Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> It didn't take me very long to rebase the code well enough to make
> the regression tests pass, but the isolation tests are failing, so
> I think I've mucked up the predicate locking or something
> somewhere in here. Anyhow, that led me to a careful comparison of
> the logic in the two above-named functions, which as you might
> expect have drifted apart slightly.
> In index_getnext()
> I can't help noticing that the code no longer matches the comment
> which precedes it.
Good point. That obviously needs to be fixed.
> Apparently we can skip examining any more members only when SSI is
> not in use
More precisely, it depends on whether the transaction which is
executing here is serializable; but yeah. (Some of our quick
returns are based on whether *any* serializable transaction is
active, and I didn't want anyone to mistake the test you're
describing for that one.)
> probably because we need to predicate-lock the remaining tuples in
> the HOT chain.
I had to think about this a while. If we see a tuple which *is*
visible to us, we need to acquire tuple locks on that and any later
versions of the tuple. When going through the index, we won't get
another shot at later HOT versions of the tuple, so we have to find
them now. If we can avoid it we should not lock any tuples
representing an *earlier* version of the row than we can see, or any
tuples from rows with no visible versions; unnecessarily locking
such tuples could reduce performance but would not allow anomalies.
> the logic in heap_hot_search_buffer() is slightly different
> Here again we're refusing to bypass the remainder of the HOT chain
> when the visible tuple is found, if SSI is in use. But the test
> is slightly different. Here it applies whenever
> IsolationIsSerializable(), whereas in index_getnext() it applies
> only when IsolationIsSerializable() *and* we're using an MVCC
> snapshot.
> Does SSI care about non-MVCC snapshots?
We should only be acquiring a predicate lock on a tuple for the
normal MVCC snapshot used by the serializable transaction. Perhaps
this is a test we should add to our "quick return" macros checked on
entry to most predicate.c functions; it seems burdensome to force
all callers to know such details. In index_getnext() it was doing
the check already, which is fine, but you're right -- we should be
checking this consistently to avoid false positives.
> Another thing I notice is that in heap_hot_search_buffer(), when
> this SSI-related change applies, we go through and visit all the
> remaining members of the HOT chain and then return the last (and
> presumably only) tuple we encountered.
That return value sounds like a bug. It certainly seems like it
should return the tuple which first caused it to set match_found =
true; that is to say, the same one which would be returned for a
non-serializable transaction.
> But in index_getnext(), we return the tuple we found immediately
> and then continue walking the HOT chain on the next call. This
> difference seems possibly significant. If it's necessary for
> correctness that we predicate-lock the invisible tuples, then what
> happens if the scan stops here due to, say, a LIMIT?
You're right -- it should go to the end of the chain for both. In
general a LIMIT can stop the acquisition of predicate locks because
you only need to lock rows which were *seen*. But at some point
after I added these calls it became clear that tuples representing
later versions of a visible and visited row also need to be locked.
I obviously didn't find all the places to fix when I had that
epiphany.
But that raises another question. The HOT updates are easy, because
they're on a page we're already visiting. The page is pinned and
locked already, so chasing down the HOT chain on a serializable
transaction is relatively cheap. What about non-HOT updates,
though? I'm pretty sure we handle those OK for most cases, but a
LIMIT which stops us after reading a visible-but-updated tuple and
before having read the new version of it could lead to a false
negative -- an undetected anomaly. I think this could also be a
factor for indexed min/max functions, semi-joins, and anti-joins.
It's easy to see how this one got missed in testing, because it
takes at least four separate transactions with certain timings to
the overlapping to hit the "next generation" type of anomalies, and
this takes some less frequently used feature to turn up in a
particular place in that set of transactions. Nonetheless, it is a
bug and should be fixed.
I think we need to do something about that, but it's not immediately
obvious to me what the best solution is. Chasing down the pointers
to subsequent versions of the row hardly seems sane. I would hate
to have to unconditionally escalate to a heap relation lock upon
seeing a visible-but-updated tuple, but I don't know whether it's
feasible to detect when we're in a scan which can complete early for
one of the aforementioned reasons.
Any other ideas on that one?
Anyway, I could clean up all but that last issue in the old code.
I'm not sure whether that makes sense if you're refactoring it
anyway. Would you like me to look at the refactored code to suggest
fixes? Would you rather do it yourself based on my answers here?
Do we need to sort out that last issue before proceeding on the
others?
-Kevin