Re: Low hanging fruit in lazy-XID-assignment patch?

From: "Florian G(dot) Pflug" <fgp(at)phlo(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Low hanging fruit in lazy-XID-assignment patch?
Date: 2007-09-07 19:25:48
Message-ID: 46E1A5BC.50900@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Tom Lane wrote:
> I've spent the past hour or so trying to consolidate the comments in
> GetSnapshotData and related places into a single chunk of text to be
> added to src/backend/access/transam/README. Attached is what I have so
> far --- this incorporates the idea of not taking ProcArrayLock to exit
> an XID-less transaction, but not yet Florian's idea. I think it'd get
> simpler if we changed to that, but am posting this for comments.

> Interlocking transaction begin, transaction end, and snapshots
> --------------------------------------------------------------
>
> We try hard to minimize the amount of overhead and lock contention involved
> in the frequent activities of beginning/ending a transaction and taking a
> snapshot. Unfortunately, we must have some interlocking for this, because
> it is critical that all backends agree on the commit order of transactions.
This is actually still a slightly stronger requirement than what we really
need I think.

To simplify the following discussion, "A -> B" shall mean that transaction A saw
B as committed. Conversely, "A !> B" shall mean that A treats B as in-progress.
If A was in read-committed mode, the visibility refers to the latest snapshot
that A used.

Now assume A and B commit at nearly the same time, and for two other
transactions C and D the following holds:
C -> A, C !> B but
D !> A, D -> B.

This would violate the requirement that the commit order is globally
agreed upon, yet as long as both A !> B and B !> A holds, there is no
conflict. (Note that if A and B are serializable, A !> B && B !> A
implies that A and B cannot have touched the same record and have
both committed - one would have been aborted due to a
SerializationError).

I must admit, though, that this is a quite academic case, since the
prerequisite A !> B and B !> A is something we have no control over
for read-committed transactions - who knows when they might have taken
their last snapshot...

Still, I wanted to mention this because I believe that the minimal
requirement that we actually *need* to enforce is
A -> B and B -> C imply A -> C. (T1)

The actual implementation will probably always have to enforce
something slightly stronger, but it's still nice to know the
minimal guarantee needed to be able to judge correctness.

> For example, suppose an UPDATE in xact A is blocked by xact B's prior
> update of the same row, and xact B is doing commit while xact C gets a
> snapshot. Xact A can complete and commit as soon as B releases its locks.
> If xact C's GetSnapshotData sees xact B as still running, then it had
> better see xact A as still running as well, or it will be able to see two
> tuple versions - one deleted by xact B and one inserted by xact A.
In my notation this becomes: A -> B and C !> B implies C !> A.

This then follows from (T1) - Assume that A -> B, C !> B but C -> A,
then with (A) C -> B follows from C -> A and A -> B, which contradicts
C !> B.

> We
> enforce this by not allowing any transaction to exit the set of running
> transactions while a snapshot is being taken. (This rule is probably
> stronger than necessary, but see the next problem.) The implementation
> of this is that GetSnapshotData takes the ProcArrayLock in shared mode
> (thereby allowing multiple backends to take snapshots in parallel), but
> xact.c must take the ProcArrayLock in exclusive mode while clearing
> MyProc->xid at transaction end (either commit or abort).
Agreed. We *do* enforce a strict global ordering of committs and snapshots.
This then guarantees (T1) because if A -> B and B -> C, than A *must*
have taken it's snapshot after B committed, and B in turn *must* have
taken it's snapshot after C committed, so surely A -> C will hold too.

> Here is another variant of the risk scenario:
>
> 1. Xact A is running (in Read Committed mode).
> 2. Xact C's GetSnapshotData reads next transaction ID into xmax, then is
> swapped out before it can acquire ProcArrayLock.
> 3. Xact B gets new XID (>= C's xmax), makes changes and commits.
> 4. Xact A changes some row R changed by xact B and commits.
> 5. Xact C finishes getting its snapshot data. It sees xact A as done,
> but sees xact B as still running (since B >= xmax).
>
> Now C will see R changed by xact B and then xact A, *but* does not see
> other changes made by xact B. If C is supposed to be in Serializable mode,
> this is wrong.
I never really grasped why we need to assume serializable here - this seems
wrong if C is read-committed too. Seeing only half of a transaction's changes
can never be right, can it?

> To prevent this it is necessary that GetSnapshotData acquire ProcArrayLock
> before it calls ReadNewTransactionId. This prevents xact A from exiting
> the set of running transactions seen by xact C. Therefore both A and B
> will be seen as still running => no inconsistency.
Another point of view is that determining the xmax of a snapshot really *is*
part of taking the snapshot. Since we already obtained that we need to
serialize snapshotting and committing, it follows that we must not allow
committs to happen between the time we take the xmax and the time we
complete the snapshot.

> In short, then, the rule is that no transactions may exit the set of
> currently-running transactions between the time we fetch xmax and the time
> we finish building our snapshot. However, this restriction only applies
> to transactions that have an XID --- read-only transactions can end without
> acquiring ProcArrayLock, since they don't affect anyone else's snapshot.
Yes, they completely fall out of the picture, because nobody ever cares if
they even committed.

> We must also require GetNewTransactionId to store the new XID into the
> shared ProcArray before releasing XidGenLock. This ensures that when
> GetSnapshotData calls ReadNewTransactionId (which also takes XidGenLock),
> all active XIDs before the returned value of nextXid are already present in
> the ProcArray and can't be missed by GetSnapshotData. Unfortunately, we
> can't have GetNewTransactionId take ProcArrayLock to do this, else it could
> deadlock against GetSnapshotData. Therefore, we simply let
> GetNewTransactionId store into MyProc->xid without any lock. We are
> thereby relying on fetch/store of an XID to be atomic, else other backends
> might see a partially-set XID. (NOTE: for multiprocessors that need
> explicit memory access fence instructions, this means that
> acquiring/releasing XidGenLock is just as necessary as acquiring/releasing
> ProcArrayLock for GetSnapshotData to ensure it sees up-to-date xid fields.)
> This also means that readers of the ProcArray xid fields must be careful to
> fetch a value only once, rather than assume they can read it multiple times
> and get the same answer each time.
In other words: It won't do for transaction to acquire an xid, then go into
hiding, and later suddenly reappear and start storing it's rather old xid
somewhere.
Or, again, differently put: When we take a snapshot we must be sure that any
xid < our xmax is either not in use anymore, or shows up in the proc array.

By storing the xid in the proc array while holding the XidGenLock, we
ensure that there is always is at most one xid assigned, but not showing
up in the proc array. This xid is always equal to either nextXid, or
nextXid-1. It is the "nextXid-1" case that forces us to take the XidGenLock
- it ensures that *no* xid is assigned but not showing up in the proc array.
If we used any older value for the xmax, the nextXid-1 case wouldn't hurt us -
the "hidden" xid will still be >= our xmax.

The last paragraph is in essence what guarantees the correctness of using
LargestCompletedXid instead of ReadNewTransactionId() as the xmax.

> Another important activity that uses the shared ProcArray is GetOldestXmin,
> which must determine a lower bound for the oldest xmin of any active MVCC
> snapshot, system-wide. Each individual backend advertises the smallest
> xmin of its own snapshots in MyProc->xmin, or zero if it currently has no
> live snapshots (eg, if it's between transactions or hasn't yet set a
> snapshot for a new transaction). GetOldestXmin takes the MIN() of the
> valid xmin fields. It does this with only shared lock on ProcArrayLock,
> which means there is a potential race condition against other backends
> doing GetSnapshotData concurrently: we must be certain that a concurrent
> backend that is about to set its xmin does not compute an xmin less than
> what GetOldestXmin returns. We ensure that by including all the active
> XIDs into the MIN() calculation, along with the valid xmins. The rule that
> transactions can't exit without taking exclusive ProcArrayLock ensures that
> concurrent holders of shared ProcArrayLock will compute the same minimum of
> currently-active XIDs: no xact, in particular not the oldest, can exit
> while we hold shared ProcArrayLock. So GetOldestXmin's view of the minimum
> active XID will be the same as that of any concurrent GetSnapshotData, and
> so it can't produce an overestimate. If there is no active transaction at
> all, GetOldestXmin returns the result of ReadNewTransactionId. Note that
> two concurrent executions of GetOldestXmin might not see the same result
> from ReadNewTransactionId --- but if there is a difference, the intervening
> execution(s) of GetNewTransactionId must have stored their XIDs into the
> ProcArray, so the later execution of GetOldestXmin will see them and
> compute the same global xmin anyway.
Agreed. A shorter reasoning would maybe be:

We need to ensure that no snapshot that is still in use has a xmin smaller
than what GetOldestXmin() returns. We have to check two cases, snapshot
creation and transaction exit.

During snapshot creation, we compute the xmin as MIN() of all xids in our
snapshot. A concurrent GetOldestXmin() or GlobalXmin computation cannot derive a
larger value, since it takes the published xids into account, as well as the
published xmins. This, together with the rule that no transaction can remove
it's xid from the set of running transactions while we either take a snapshot or
do GetOldestXmin() guarantees correctness.

During transaction exit, we don't need to take any lock to clear our xmin.
The xmin won't influence the xmin calculations of other transactions, only
GlobalXmin and GetOldestXmin() computations. Since we won't use our snapshot
anymore, we don't care if they take our xmin into account for that or not.

> GetSnapshotData also performs an oldest-xmin calculation (which had better
> match GetOldestXmin's) and stores that into RecentGlobalXmin, which is used
> for some tuple age cutoff checks where a fresh call of GetOldestXmin seems
> too expensive. Note that while it is certain that two concurrent
> executions of GetSnapshotData will compute the same xmin for their own
> snapshots, as argued above, it is not certain that they will arrive at the
> same estimate of RecentGlobalXmin. This is because we allow XID-less
> transactions to clear their MyProc->xmin asynchronously (without taking
> ProcArrayLock), so one execution might see what had been the oldest xmin,
> and another not. This is OK since RecentGlobalXmin need only be a valid
> lower bound. As noted above, we are already assuming that fetch/store
> of the xid fields is atomic, so assuming it for xmin as well is no extra
> risk.
Agreed - see above for a equivalent description

I had a rather hard time understanding these things initially - at least
for me, the enlightenment came when I realized that most of the locking
in there actually ensures that (T1) holds - or in other words, that
the relation "A sees B as committed" *has* to be a transitive one.

So I'd like to see this reasoning in that README - though maybe I'm the
only one to whom this is the clearest wording...

greetings, Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2007-09-07 19:59:42 Re: Low hanging fruit in lazy-XID-assignment patch?
Previous Message Pavel Stehule 2007-09-07 19:13:52 integrated tsearch doesn't work with non utf8 database