Re: spinlock contention

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: spinlock contention
Date: 2011-06-23 18:34:55
Message-ID: 213CB357-8490-48C2-BDAB-D1381E7E1F52@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Jun23, 2011, at 17:42 , Robert Haas wrote:
> On Wed, Jun 22, 2011 at 5:43 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> On Jun12, 2011, at 23:39 , Robert Haas wrote:
>>> So, the majority (60%) of the excess spinning appears to be due to
>>> SInvalReadLock. A good chunk are due to ProcArrayLock (25%).
>>
>> Hm, sizeof(LWLock) is 24 on X86-64, making sizeof(LWLockPadded) 32.
>> However, cache lines are 64 bytes large on recent Intel CPUs AFAIK,
>> so I guess that two adjacent LWLocks currently share one cache line.
>>
>> Currently, the ProcArrayLock has index 4 while SInvalReadLock has
>> index 5, so if I'm not mistaken exactly the two locks where you saw
>> the largest contention on are on the same cache line...
>>
>> Might make sense to try and see if these numbers change if you
>> either make LWLockPadded 64bytes or arrange the locks differently...
>
> I fooled around with this a while back and saw no benefit. It's
> possible a more careful test would turn up something, but I think the
> only real way forward here is going to be to eliminate some of that
> locking altogether.

It seems hard to believe that there isn't some effect of two locks
sharing a cache line. There are architectures (even some of the
Intel architectures, I believe) where cache lines are 32 bit, though -
so measuring this certainly isn't easy. Also, I'm absolute no
expert for this, so it might very well be that' I'm missing something
fundamental.

> SInvalReadLock is a good example of a lock which seems barely to be
> necessary at all. Except on extraordinarily rare occasions, it is
> only ever taken in share mode.

This is the case we should optimize for, I think. Currently, there's
contention even between two SHARE lockers of a LWLock because our
LWLock implementation uses a spinlock underneath. I've googled around
a bit, and found this:

http://git.kernel.org/?p=linux/kernel/git/torvalds/rwlock.git

It's an userspace rwlock implementation based on atomic instructions
and futexes (for blocking) written by Linus Torwalds as a response
to the following thread

http://lwn.net/Articles/284741/

It seems to require only a single atomic increment instruction to
acquire a share lock in the uncontested case, and a single compare-
and-exchange instruction to acquire an exlcusive lock (again in
the uncontested case).

The code doesn't seem to have a license attached, and seems to have
been written over a weekend and never touched since, so we might
not want (or be able to) to use this directly. But it'd still be
interesting to see if replacing our LWLocks with this implementation
affects throughput.

> Only SICleanupQueue() needs to lock
> out readers, and in the (probably common) case where all backends are
> reasonably close to being caught up, it wouldn't even matter if the
> determination of minMsgNum were a little bit fuzzy. I've been
> straining my brain trying to figure out whether there's some way to
> rewrite this logic so that it can run concurrently with readers; then
> we could get rid of SInvalReadLock altogether. I have a feeling if I
> were 25% smarter I could do it, but I'm not.

This sounds like something which could be done with RCU (Read-Copy-Update)
in a lockless way. The idea is to copy the data structure (or parts)
of it before updating it, and to "commit" the change afterwards by
adjusting a single pointer to point to the new structure (kinda like
we do atomic commits by an atomic update of one bit in the clog).

The hard part is usually to figure when it's OK to clean up or
reuse the old version of the data structure - for that, you need to
know that all previous readers have completed.

Here's a rough description of how that could work for the invalidation
queue. We'd have two queue structures in shared memory, each containing
a version number. We'd also have a "current" pointer pointing to the
most recent one. Each reader would publish the version number of the
queue he's about to read (the "current" one) in shared memory
(and issue a write barrier). SICleanupQueue() would check whether the
non-current queue structure is unused by comparing its version to the
published versions of all backends (after issuing a read barrier, and
after taking a lock that prevents concurrent SICleanupQueue runs of
course). If there still are users of that version, SICleanupQueue()
out-waits them. Afterwards, it simply copies the current structure
over the old one, does the necessary cleanups, increments the version
number to be larger than the one of the "current" structure,
and *then* updates the "current" pointer.

> So my fallback position is to try to find some way to optimize the
> common case where there are no messages to be read. We're already
> allowing readers and writers to run in parallel, so it must not be
> important for a reader to see a message that a writer is in the
> process of writing, but has not yet finished writing. I believe the
> idea here is that sinval messages should be emitted before releasing
> the related locks, so if we've successfully acquired a lock on an
> object, then any pertinent sinval messages must already be completely
> written. What I'm thinking we could do is just keep a global counter
> indicating how many messages have been written, and each backend could
> keep a counter indicating the highest-numbered message it has so far
> read. When a message is written, the global counter is bumped. When
> a backend goes to AcceptInvalidationMessages(), it compares the global
> counter to its own counter, and if they're the same, then it doesn't
> bother taking SInvalReadLock and just exits quickly.

Sounds sensible. I do fear, however, that if we start to micro-optimize
around every single LWLock we're in for quite a maintenance burden in
the future. Even if each single modification is relatively simple, the
complexity still adds ups. So I still believe that it'd be better to
start with optimizing LWLocks in general, and to only touch the LWLocks
which are still hot spots afterwards.

> There is an obvious (potential) memory-ordering problem here. If it's
> possible for a backend doing AcceptInvalidationMessages() to see a
> stale version of the counter, then it might fail to read messages from
> the queue that it really ought to have seen. Protecting the global
> counter with a spinlock defeats the purpose of doing any of this in
> the first place, because the whole problem is too many people fighting
> over the spinlock protecting SInvalReadLock. It should be sufficient,
> though, for the writing backend to execute a memory-barrier operation
> just after bumping the global counter and the reading backend to
> execute one just before. As it happens, LWLockRelease() is such a
> barrier, since it must acquire and release a spinlock, so we should be
> able to just bump the counter right before releasing the LWLock and
> call it good. On the reading side, the immediately-preceding lock
> acquisition seems like it ought to be sufficient - surely it can't be
> possible to acquire a heavyweight lock on an object without seeing
> memory updates that some other process made before releasing the same
> heavyweight lock.

If we start doing lockless algorithms, I really think we should add
explicit barrier primitives. We could start out with something
simple such as

#define MemoryBarrierWrite \
do { slock_t l; S_UNLOCK(l); } while (0)
#define MemoryBarrierRead \
do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
#define MemoryBarrierReadWrite \
do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }

But yeah, it seems that in the case of the sinval queue, the surrounding
LWLocks actually provide the necessary barriers.

> A possible objection here is that a 32-bit counter might wrap around,
> giving a false negative, and a read from a 64-bit counter might not be
> atomic. In practice, the chances of a problem here seem remote, but I
> think we can guard against it easily enough. We can put each
> backend's counter of messages read into shared memory, protected by a
> per-backend lock (probably the same LWLock the fast relation lock
> patch already introduces). When a backend compares its counter to the
> global counter, it does so while holding this lock. When
> SICleanupQueue() resets a backend, it grabs the lock and sets that
> backend's counter to a special value (like 0) that we guarantee will
> never match the global counter (which we can cause to wrap from 2^32-1
> to 1). So then AcceptInvalidationMessages() - in the common case
> where there are no invalidation messages to process - will only need
> the per-backend LWLock rather than fighting over SInvalLock.

OTOH, don't all architectures that are interesting targets of
performance tuning provide atomic access to 64-bit values? Even i386
has an 8-byte compare-and-exchange instruction I think...

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Gurjeet Singh 2011-06-23 18:44:57 Re: SYNONYMS (again)
Previous Message Alvaro Herrera 2011-06-23 17:48:04 Re: Fwd: Keywords in pg_hba.conf should be field-specific