From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at> |
Subject: | Re: Spinlocks, yet again: analysis and proposed patches |
Date: | 2005-10-12 17:44:50 |
Message-ID: | 1129139090.8300.568.camel@localhost.localdomain |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
> I wrote:
> > Another thought came to mind: maybe the current data layout for LWLocks
> > is bad. Right now, the spinlock that protects each LWLock data struct
> > is itself part of the struct, and since the structs aren't large (circa
> > 20 bytes), the whole thing is usually all in the same cache line. ...
> > Maybe it'd be better to allocate the spinlocks off by themselves.
>
> Well, this is odd. I made up a patch to do this (attached) and found
> that it pretty much sucks. Still the 4-way Opteron, previous best
> (slock-no-cmpb and spin-delay-2):
> 1 31s 2 42s 4 51s 8 100s
> with lwlock-separate added:
> 1 31s 2 52s 4 106s 8 213s
>
> What I had expected to see was a penalty in the single-thread case (due
> to instructions added to the LWLock functions), but there isn't any.
> I did not expect to see a factor-of-2 penalty for four threads.
>
> I guess what this means is that there's no real problem with losing the
> cache line while manipulating the LWLock, which is what the patch was
> intended to prevent. Instead, we're paying for swapping two cache lines
> (the spinlock and the LWLock) across processors instead of just one line.
> But that should at worst be a 2x inflation of the time previously spent
> in LWLockAcquire/Release, which is surely not yet all of the application
> ;-). Why the heck is this so bad?
(Just going back through the whole thread for completeness.)
> Should we expect that apparently
> minor changes in shared data structures might be costing equivalently
> huge penalties in SMP performance elsewhere?
Yes. That's the advice from Intel and AMD; but we should add that there
is potential for improving performance also.
The possible problem we were trying to avoid here was false sharing of
the cache line by lock requestors of locks whose spin locks were
adjacent in memory.
Splitting the data structure was just one of the possible ways of
avoiding that. The above shows that the possible solution described
above didn't work, but there could be others.
One thing we tried in February was padding out the statically defined
locks with dummy lock definitions in the enum. This has the effect of
ensuring that the most contested locks are very definitely in their own
cache line and not shared with others.
That showed a noticeable improvement in performance, probably because it
costs very little to implement, even if the code would require some
explanatory documentation.
The lwlock structure in include/storage/lwlock.h looks like
typedef enum LWLockId
{
BufMappingLock,
BufFreelistLock,
LockMgrLock,
OidGenLock,
XidGenLock,
ProcArrayLock,
SInvalLock,
FreeSpaceLock,
WALInsertLock,
WALWriteLock,
...
Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
are adjacent to at least one other heavily contested lock. So they are
certain to be in the same cache line and therefore to cause false
sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
optimization handbooks).
This could be replaced by...
typedef enum LWLockId
{
BufMappingLock,
BufMappingLock_Padding1,
BufMappingLock_Padding2,
BufMappingLock_Padding3,
BufMappingLock_Padding4,
BufMappingLock_Padding5,
BufMappingLock_Padding6,
BufMappingLock_Padding7,
BufFreelistLock,
BufFreelistLock_Padding1,
BufFreelistLock_Padding2,
BufFreelistLock_Padding3,
BufFreelistLock_Padding4,
BufFreelistLock_Padding5,
BufFreelistLock_Padding6,
BufFreelistLock_Padding7,
LockMgrLock,
LockMgrLock_Padding1,
LockMgrLock_Padding2,
LockMgrLock_Padding3,
LockMgrLock_Padding4,
LockMgrLock_Padding5,
LockMgrLock_Padding6,
LockMgrLock_Padding7,
OidGenLock,
XidGenLock,
ProcArrayLock,
SInvalLock,
FreeSpaceLock,
WALInsertLock,
WALInsertLock_Padding1,
WALInsertLock_Padding2,
WALInsertLock_Padding3,
WALInsertLock_Padding4,
WALInsertLock_Padding5,
WALInsertLock_Padding6,
WALInsertLock_Padding7,
WALWriteLock,
WALWriteLock_Padding1,
WALWriteLock_Padding2,
WALWriteLock_Padding3,
WALWriteLock_Padding4,
WALWriteLock_Padding5,
WALWriteLock_Padding6,
WALWriteLock_Padding7,
...
where the number of padding locks is determined by how many lock
structures fit within a 128 byte cache line.
This isn't exactly elegant coding, but it provides a useful improvement
on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
pretty darn stupid. But it does follow the CPU optimization handbook
advice and I did see a noticeable improvement in performance and a
reduction in context switching.
I'm not in a position to try this again now on 8.1beta, but I'd welcome
a performance test result from anybody that is. I'll supply a patch
against 8.1beta for anyone wanting to test this.
Best Regards, Simon Riggs
From | Date | Subject | |
---|---|---|---|
Next Message | Esha Palta | 2005-10-12 17:46:55 | definition of evalfunc for execution of join condition |
Previous Message | Eric Sproul | 2005-10-12 17:00:24 | 8.1 beta1 -> beta2 upgrade question |