Quick Links

Re: Proposal of tunable fix for scalability of 8.4

From:	Matthew Wakeling <matthew(at)flymine(dot)org>
To:	"pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Proposal of tunable fix for scalability of 8.4
Date:	2009-03-16 16:26:27
Message-ID:	alpine.DEB.2.00.0903161543590.21772@aragorn.flymine.org
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Sat, 14 Mar 2009, Heikki Linnakangas wrote:
> I think the elephant in the room is that we have a single lock that needs to
> be acquired every time a transaction commits, and every time a backend takes
> a snapshot.

I like this line of thinking.

There are two valid sides to this. One is the elephant - can we remove the
need for this lock, or at least reduce its contention. The second is the
fact that these tests have shown that the locking code has potential for
improvement in the case where there are many processes waiting on the same
lock. Both could be worked on, but perhaps the greatest benefit will come
from stopping a single lock being so contended in the first place.

One possibility would be for the locks to alternate between exclusive and
shared - that is:

1. Take a snapshot of all shared waits, and grant them all - thundering
herd style.
2. Wait until ALL of them have finished, granting no more.
3. Take a snapshot of all exclusive waits, and grant them all, one by one.
4. Wait until all of them have been finished, granting no more.
5. Back to (1).

This may also possibly improve CPU cache coherency. Or of course, it may
make everything much worse - I'm no expert. It would avoid starvation
though.

> It's going require some hard thinking to bust that bottleneck. I've sometimes
> thought about maintaining a pre-calculated array of in-progress XIDs in
> shared memory. GetSnapshotData would simply memcpy() that to private memory,
> instead of collecting the xids from ProcArray.

Shifting the contention from reading that data to altering it. But that
would probably be quite a lot fewer times, so it would be a benefit.

> Or we could try to move some of the if-tests inside the for-loop to
> after the ProcArrayLock is released.

That's always a useful change.

On Sat, 14 Mar 2009, Tom Lane wrote:
> Now the fly in the ointment is that there would need to be some way to
> ensure that we didn't write data out to disk until it was valid; in
> particular how do we implement a request to flush WAL up to a particular
> LSN value, when maybe some of the records before that haven't been fully
> transferred into the buffers yet? The best idea I've thought of so far
> is shared/exclusive locks on the individual WAL buffer pages, with the
> rather unusual behavior that writers of the page would take shared lock
> and only the reader (he who has to dump to disk) would take exclusive
> lock. But maybe there's a better way. Currently I don't believe that
> dumping a WAL buffer (WALWriteLock) blocks insertion of new WAL data,
> and it would be nice to preserve that property.

The writers would need to take a shared lock on the page before releasing
the lock that marshals access to the "how long is the log" data. Other
than that, your idea would work.

An alternative would be to maintain a concurrent linked list of WAL writes
in progress. An entry would be added to the tail every time a new writer
is generated, marking the end of the log. When a writer finishes, it can
remove the entry from the list very cheaply and with very little
contention. The reader (who dumps the WAL to disc) need only look at the
head of the list to find out how far the log is completed, because the
list is guaranteed to be in order of position in the log.

The linked list would probably be simpler - the writers don't need to lock
multiple things. It would also have fewer things accessing each
lock, and therefore maybe less contention. However, it may involve more
locks than the one lock per WAL page method, and I don't know what the
overhead of that would be. (It may be fewer - I don't know what the
average WAL write size is.)

Matthew

--
What goes up must come down. Ask any system administrator.

In response to

Re: Proposal of tunable fix for scalability of 8.4 at 2009-03-14 08:23:57 from Heikki Linnakangas

Responses

Re: Proposal of tunable fix for scalability of 8.4 at 2009-03-18 07:53:53 from Simon Riggs
Re: Proposal of tunable fix for scalability of 8.4 at 2009-03-18 11:20:16 from Heikki Linnakangas

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Greg Smith	2009-03-16 16:39:43	Re: Postgres benchmarking with pgbench
Previous Message	Joshua D. Drake	2009-03-16 16:21:30	Re: Performance of archive logging in a PITR restore