From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Reduce ProcArrayLock contention |
Date: | 2015-08-19 15:39:01 |
Message-ID: | 20150819153901.GC10770@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
> On Wed, Aug 5, 2015 at 10:59 AM, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> OK, committed.
I spent some time today reviewing the commited patch. So far my only
major complaint is that I think the comments are only insufficiently
documenting the approach taken:
Stuff like avoiding ABA type problems by clearling the list entirely and
it being impossible that entries end up on the list too early absolutely
needs to be documented explicitly.
I think I found a few minor correctness issues. Mostly around the fact
that we, so far, tried to use semaphores in a way that copes with
unrelated unlocks "arriving". I actually think I removed all of the
locations that caused that to happen, but I don't think we should rely
on that fact. Looking at the following two pieces of code:
/* If the list was not empty, the leader will clear our XID. */
if (nextidx != INVALID_PGPROCNO)
{
/* Sleep until the leader clears our XID. */
while (pg_atomic_read_u32(&proc->nextClearXidElem) != INVALID_PGPROCNO)
{
extraWaits++;
PGSemaphoreLock(&proc->sem);
}
/* Fix semaphore count for any absorbed wakeups */
while (extraWaits-- > 0)
PGSemaphoreUnlock(&proc->sem);
return;
}
...
/*
* Now that we've released the lock, go back and wake everybody up. We
* don't do this under the lock so as to keep lock hold times to a
* minimum. The system calls we need to perform to wake other processes
* up are probably much slower than the simple memory writes we did while
* holding the lock.
*/
while (wakeidx != INVALID_PGPROCNO)
{
PGPROC *proc = &allProcs[wakeidx];
wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);
if (proc != MyProc)
PGSemaphoreUnlock(&proc->sem);
}
There's a bunch of issues with those two blocks afaics:
1) The first block (in one backend) might read INVALID_PGPROCNO before
ever locking the semaphore if a second backend quickly enough writes
INVALID_PGPROCNO. That way the semaphore will be permanently out of
"balance".
2) There's no memory barriers around dealing with nextClearXidElem in
the first block. Afaics there needs to be a read barrier before
returning, otherwise it's e.g. not guaranteed that the woken up
backend sees its own xid set to InvalidTransactionId.
3) If a follower returns before the leader has actually finished woken
that respective backend up we can get into trouble:
Consider what happens if such a follower enqueues in another
transaction. It is not, as far as I could find out, guaranteed on all
types of cpus that a third backend can actually see nextClearXidElem
as INVALID_PGPROCNO. That'd likely require SMT/HT cores and multiple
sockets. If the write to nextClearXidElem is entered into the local
store buffer (leader #1) a hyper-threaded process (leader #2) can
possibly see it (store forwarding) while another backend doesn't
yet.
I think this is very unlikely to be an actual problem due to
independent barriers until enqueued again, but I don't want to rely
on it undocumentedly. It seems safer to replace
+ wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
+ pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);
with a pg_atomic_exchange_u32().
I think to fix these ProcArrayGroupClearXid() should use a protocol
similar to lwlock.c. E.g. make the two blocks somethign like:
while (wakeidx != INVALID_PGPROCNO)
{
PGPROC *proc = &allProcs[wakeidx];
wakeidx = pg_atomic_read_u32(&proc->nextClearXidElem);
pg_atomic_write_u32(&proc->nextClearXidElem, INVALID_PGPROCNO);
/* ensure that all previous writes are visible before follower continues */
pg_write_barrier();
proc->lwWaiting = false;
if (proc != MyProc)
PGSemaphoreUnlock(&proc->sem);
}
and
if (nextidx != INVALID_PGPROCNO)
{
Assert(!MyProc->lwWaiting);
for (;;)
{
/* acts as a read barrier */
PGSemaphoreLock(&MyProc->sem);
if (!MyProc->lwWaiting)
break;
extraWaits++;
}
Assert(pg_atomic_read_u32(&proc->nextClearXidElem) == INVALID_PGPROCNO)
/* Fix semaphore count for any absorbed wakeups */
while (extraWaits-- > 0)
PGSemaphoreUnlock(&proc->sem);
return;
}
Going through the patch:
+/*
+ * ProcArrayGroupClearXid -- group XID clearing
+ *
+ * When we cannot immediately acquire ProcArrayLock in exclusive mode at
+ * commit time, add ourselves to a list of processes that need their XIDs
+ * cleared. The first process to add itself to the list will acquire
+ * ProcArrayLock in exclusive mode and perform ProcArrayEndTransactionInternal
+ * on behalf of all group members. This avoids a great deal of context
+ * switching when many processes are trying to commit at once, since the lock
+ * only needs to be handed from the last share-locker to one process waiting
+ * for the exclusive lock, rather than to each one in turn.
+ */
+static void
+ProcArrayGroupClearXid(PGPROC *proc, TransactionId latestXid)
+{
This comment, in my opinion, is rather misleading. If you have a
workload that primarily is slowed down due to transaction commits, this
patch doesn't actually change the number of context switches very
much. Previously all backends enqueued in the lwlock and got woken up
one-by-one. Safe backends 'jumping' the queue while a lock has been
released but the woken up backend doesn't yet run, there'll be exactly
as many context switches as today.
The difference is that only one backend has to actually acquire the
lock. So what has changed is the number of times, and the total
duration, the lock is actually held in exclusive mode.
+ /* Support for group XID clearing. */
+ volatile pg_atomic_uint32 nextClearXidElem;
+ /* First pgproc waiting for group XID clear */
+ volatile pg_atomic_uint32 nextClearXidElem;
Superfluous volatiles.
I don't think it's a good idea to use the variable name in PROC_HDR and
PGPROC, it's confusing.
How hard did you try checking whether this causes regressions? This
increases the number of atomics in the commit path a fair bit. I doubt
it's really bad, but it seems like a good idea to benchmark something
like a single full-throttle writer and a large number of readers.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | David Fetter | 2015-08-19 15:40:41 | Re: Proposal: Implement failover on libpq connect level. |
Previous Message | jacques klein | 2015-08-19 15:37:31 | how to write/setup a C trigger function in a background worker |