| From: | Bruce Momjian <pgman(at)candle(dot)pha(dot)pa(dot)us> | 
|---|---|
| To: | Simon Riggs <simon(at)2ndquadrant(dot)com> | 
| Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, Michael Paesold <mpaesold(at)gmx(dot)at> | 
| Subject: | Re: Spinlocks, yet again: analysis and proposed patches | 
| Date: | 2006-06-14 21:47:13 | 
| Message-ID: | 200606142147.k5ELlDl28970@candle.pha.pa.us | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Added to TODO list.
---------------------------------------------------------------------------
Simon Riggs wrote:
> On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote:
> > I wrote:
> > > Another thought came to mind: maybe the current data layout for LWLocks
> > > is bad.  Right now, the spinlock that protects each LWLock data struct
> > > is itself part of the struct, and since the structs aren't large (circa
> > > 20 bytes), the whole thing is usually all in the same cache line. ...
> > > Maybe it'd be better to allocate the spinlocks off by themselves.
> > 
> > Well, this is odd.  I made up a patch to do this (attached) and found
> > that it pretty much sucks.  Still the 4-way Opteron, previous best
> > (slock-no-cmpb and spin-delay-2):
> > 	1 31s	2 42s	4 51s	8 100s
> > with lwlock-separate added:
> > 	1 31s	2 52s	4 106s	8 213s
> > 
> > What I had expected to see was a penalty in the single-thread case (due
> > to instructions added to the LWLock functions), but there isn't any.
> > I did not expect to see a factor-of-2 penalty for four threads.
> > 
> > I guess what this means is that there's no real problem with losing the
> > cache line while manipulating the LWLock, which is what the patch was
> > intended to prevent.  Instead, we're paying for swapping two cache lines
> > (the spinlock and the LWLock) across processors instead of just one line.
> > But that should at worst be a 2x inflation of the time previously spent
> > in LWLockAcquire/Release, which is surely not yet all of the application
> > ;-).  Why the heck is this so bad?  
> 
> (Just going back through the whole thread for completeness.)
> 
> > Should we expect that apparently
> > minor changes in shared data structures might be costing equivalently
> > huge penalties in SMP performance elsewhere?
> 
> Yes. That's the advice from Intel and AMD; but we should add that there
> is potential for improving performance also.
> 
> The possible problem we were trying to avoid here was false sharing of
> the cache line by lock requestors of locks whose spin locks were
> adjacent in memory.
> 
> Splitting the data structure was just one of the possible ways of
> avoiding that. The above shows that the possible solution described
> above didn't work, but there could be others.
> 
> One thing we tried in February was padding out the statically defined
> locks with dummy lock definitions in the enum. This has the effect of
> ensuring that the most contested locks are very definitely in their own
> cache line and not shared with others.
> That showed a noticeable improvement in performance, probably because it
> costs very little to implement, even if the code would require some
> explanatory documentation. 
> 
> The lwlock structure in include/storage/lwlock.h looks like
> 
> typedef enum LWLockId
> {
> 	BufMappingLock,
> 	BufFreelistLock,
> 	LockMgrLock,
> 	OidGenLock,
> 	XidGenLock,
> 	ProcArrayLock,
> 	SInvalLock,
> 	FreeSpaceLock,
> 	WALInsertLock,
> 	WALWriteLock,
> 	...
> 
> Notice that the heavily contested locks (i.e. first 3 and the WAL locks)
> are adjacent to at least one other heavily contested lock. So they are
> certain to be in the same cache line and therefore to cause false
> sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's
> optimization handbooks).
> 
> This could be replaced by...
> 
> typedef enum LWLockId
> {
> 	BufMappingLock,
> 	BufMappingLock_Padding1,
> 	BufMappingLock_Padding2,
> 	BufMappingLock_Padding3,
> 	BufMappingLock_Padding4,
> 	BufMappingLock_Padding5,
> 	BufMappingLock_Padding6,
> 	BufMappingLock_Padding7,
> 	BufFreelistLock,
> 	BufFreelistLock_Padding1,
> 	BufFreelistLock_Padding2,
> 	BufFreelistLock_Padding3,
> 	BufFreelistLock_Padding4,
> 	BufFreelistLock_Padding5,
> 	BufFreelistLock_Padding6,
> 	BufFreelistLock_Padding7,
> 	LockMgrLock,
> 	LockMgrLock_Padding1,
> 	LockMgrLock_Padding2,
> 	LockMgrLock_Padding3,
> 	LockMgrLock_Padding4,
> 	LockMgrLock_Padding5,
> 	LockMgrLock_Padding6,
> 	LockMgrLock_Padding7,
> 	OidGenLock,
> 	XidGenLock,
> 	ProcArrayLock,
> 	SInvalLock,
> 	FreeSpaceLock,
> 	WALInsertLock,
> 	WALInsertLock_Padding1,
> 	WALInsertLock_Padding2,
> 	WALInsertLock_Padding3,
> 	WALInsertLock_Padding4,
> 	WALInsertLock_Padding5,
> 	WALInsertLock_Padding6,
> 	WALInsertLock_Padding7,
> 	WALWriteLock,
> 	WALWriteLock_Padding1,
> 	WALWriteLock_Padding2,
> 	WALWriteLock_Padding3,
> 	WALWriteLock_Padding4,
> 	WALWriteLock_Padding5,
> 	WALWriteLock_Padding6,
> 	WALWriteLock_Padding7,
> 	...
> 
> where the number of padding locks is determined by how many lock
> structures fit within a 128 byte cache line.
> 
> This isn't exactly elegant coding, but it provides a useful improvement
> on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks
> pretty darn stupid. But it does follow the CPU optimization handbook
> advice and I did see a noticeable improvement in performance and a
> reduction in context switching.
> 
> I'm not in a position to try this again now on 8.1beta, but I'd welcome
> a performance test result from anybody that is. I'll supply a patch
> against 8.1beta for anyone wanting to test this.
> 
> Best Regards, Simon Riggs
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 3: Have you checked our extensive FAQ?
> 
>                http://www.postgresql.org/docs/faq
> 
-- 
  Bruce Momjian   http://candle.pha.pa.us
  EnterpriseDB    http://www.enterprisedb.com
+ If your life is a hard drive, Christ can be your backup. +
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Bruce Momjian | 2006-06-14 21:54:56 | Re: libpq's pollution of application namespace | 
| Previous Message | Luke Lonergan | 2006-06-14 21:45:07 | Re: Multi-byte and client side character encoding |