Dan Ports <drkp(at)csail(dot)mit(dot)edu> wrote:
> While running some benchmarks to test SSI performance, I found a
> race condition that's capable of causing a segfault. A patch is
> attached.
>
> The bug is in CheckTargetForConflictsIn, which scans the list of
> SIREAD locks on a lock target when it's modified. There's an
> optimization in there where the writing transaction will remove a
> SIREAD lock that it holds itself, because it's being replaced with
> a (stronger) write lock. To do that, it needs to drop its shared
> lwlocks and reacquire them in exclusive mode. The existing code
> deals with concurrent modifications in that interval by redoing
> checks. However, it misses the case where some other transaction
> removes all remaining locks on the target, and proceeds to remove
> the lock target itself.
>
> The attached patch fixes this by deferring the SIREAD lock removal
> until the end of the function. At that point, there isn't any need
> to worry about concurrent updates to the target's lock list. The
> resulting code is also simpler.
I just want to confirm that this addresses a real (although very
narrow) race condition. It results from code used to implement a
valuable optimization described in Cahill's thesis: don't get or
keep an SIREAD lock on a row which has a write lock. The write lock
is stronger and will cause a write/write conflict with any
overlapping transactions which would care about a read/write
conflict. The pattern of reading a row and then updating or
deleting it is so common that this optimization does a lot to avoid
promotion of predicate locks to coarser granularity, and thereby
helps avoid false positives.
While the optimization is valuable, the code used to implement it
was pretty horrid. (And that was me that wrote it.) It has already
been the cause of several other fixes since the main patch went in.
What Dan has done here is move the optimization out of the middle of
the loop which is doing the conflict detection, and in doing so has
reduced the number of lines of code needed, reduced the amount of
fiddling with LW locks, and all around made the code more robust and
more understandable.
I've reviewed the patch and it looks good to me. Dan has beat up on
it with the same DBT-2 run which exposed the race condition without
seeing a problem. Although a much smaller patch could address the
immediate problem, I strongly feel that Dan has taken the right
approach by refactoring this bit to something fundamentally cleaner
and less fragile.
-Kevin