From: | Florian Pflug <fgp(at)phlo(dot)org> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: spinlock contention |
Date: | 2011-06-23 21:35:49 |
Message-ID: | FE05C2BD-7FC0-44A4-88BC-A138F8B9E050@phlo.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Jun23, 2011, at 22:15 , Robert Haas wrote:
> On Thu, Jun 23, 2011 at 2:34 PM, Florian Pflug <fgp(at)phlo(dot)org> wrote:
>> It seems hard to believe that there isn't some effect of two locks
>> sharing a cache line. There are architectures (even some of the
>> Intel architectures, I believe) where cache lines are 32 bit, though -
>> so measuring this certainly isn't easy. Also, I'm absolute no
>> expert for this, so it might very well be that' I'm missing something
>> fundamental.
>
> Well, I'm sure there is some effect, but my experiments seem to
> indicate that it's not a very important one. Again, please feel free
> to provide contrary evidence. I think the basic issue is that - in
> the best possible case - padding the LWLocks so that you don't have
> two locks sharing a cache line can reduce contention on the busier
> lock by at most 2x. (The less busy lock may get a larger reduction
> but that may not help you much.) If you what you really need is for
> the contention to decrease by 1000x, you're just not really moving the
> needle.
Agreed. OTOH, adding a few dummy entries to the LWLocks array to separate
the most heavily contested LWLocks for the others might still be
worthwhile.
> That's why the basic fast-relation-lock patch helps so much:
> it replaces a system where every lock request results in contention
> with a system where NONE of them do.
>
> I tried rewriting the LWLocks using CAS. It actually seems to make
> things slightly worse on the tests I've done so far, perhaps because I
> didn't make it respect spins_per_delay. Perhaps fetch-and-add would
> be better, but I'm not holding my breath. Everything I'm seeing so
> far leads me to the belief that we need to get rid of the contention
> altogether, not just contend more quickly.
Is there a patch available? How did you do the slow path (i.e. the
case where there's contention and you need to block)? It seems to
me that without some kernel support like futexes it's impossible
to do better than LWLocks already do, because any simpler scheme
like
while (atomic_inc_post(lock) > 0) {
atomic_dec(lock);
block(lock);
}
for the shared-locker case suffers from a race condition (the lock
might be released before you actually block()).
>>> There is an obvious (potential) memory-ordering problem here. If it's
>>> possible for a backend doing AcceptInvalidationMessages() to see a
>>> stale version of the counter, then it might fail to read messages from
>>> the queue that it really ought to have seen. Protecting the global
>>> counter with a spinlock defeats the purpose of doing any of this in
>>> the first place, because the whole problem is too many people fighting
>>> over the spinlock protecting SInvalReadLock. It should be sufficient,
>>> though, for the writing backend to execute a memory-barrier operation
>>> just after bumping the global counter and the reading backend to
>>> execute one just before. As it happens, LWLockRelease() is such a
>>> barrier, since it must acquire and release a spinlock, so we should be
>>> able to just bump the counter right before releasing the LWLock and
>>> call it good. On the reading side, the immediately-preceding lock
>>> acquisition seems like it ought to be sufficient - surely it can't be
>>> possible to acquire a heavyweight lock on an object without seeing
>>> memory updates that some other process made before releasing the same
>>> heavyweight lock.
>>
>> If we start doing lockless algorithms, I really think we should add
>> explicit barrier primitives. We could start out with something
>> simple such as
>>
>> #define MemoryBarrierWrite \
>> do { slock_t l; S_UNLOCK(l); } while (0)
>> #define MemoryBarrierRead \
>> do { slock_t l; S_INIT_LOCK(l); S_LOCK(l); }
>> #define MemoryBarrierReadWrite \
>> do { slock_t; S_INIT_LOCK(l); S_LOCK(l); S_UNLOCK(l); }
>>
>> But yeah, it seems that in the case of the sinval queue, the surrounding
>> LWLocks actually provide the necessary barriers.
>
> Yes, a set of general memory barrier primitives would be handy.
> Unfortunately, it would probably be quite a bit of work to develop
> this and verify that it works properly on all supported platforms.
The idea would be to start out with something trivial like the above.
Maybe with an #if for compilers which have something like GCC's
__sync_synchronize(). We could then gradually add implementations
for specific architectures, hopefully done by people who actually
own the hardware and can test.
best regards,
Florian Pflug
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2011-06-23 21:40:33 | Re: spinlock contention |
Previous Message | Robert Haas | 2011-06-23 20:15:26 | Re: spinlock contention |