Re: Spinlocks, yet again: analysis and proposed patches

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Marko Kreen <marko(at)l-t(dot)ee>, pgsql-hackers(at)postgresql(dot)org, "Michael Paesold" <mpaesold(at)gmx(dot)at>
Subject: Re: Spinlocks, yet again: analysis and proposed patches
Date: 2005-09-15 21:49:03
Message-ID: 16824.1126820943@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I guess what this means is that there's no real problem with losing the
> cache line while manipulating the LWLock, which is what the patch was
> intended to prevent. Instead, we're paying for swapping two cache lines
> (the spinlock and the LWLock) across processors instead of just one line.
> But that should at worst be a 2x inflation of the time previously spent
> in LWLockAcquire/Release, which is surely not yet all of the application
> ;-). Why the heck is this so bad? Should we expect that apparently
> minor changes in shared data structures might be costing equivalently
> huge penalties in SMP performance elsewhere?

I did some oprofile work and found that the cost seems to be because
(1) there's an increase in spinlock contention (more time spent in
s_lock), and (2) there's more time spent in LWLockAcquire/Release.
I'm not entirely clear about the reason for (1), but (2) apparently
is because of the extra cache line swapping, as posited above. In
the oprofile trace it's clear that the extra cost comes exactly in the
statements that touch fields of the shared LWLock, which are the places
where you might have to wait to acquire a cache line.

I thought for a bit that the problem might come from having chosen to
put a pointer to the spinlock into each LWLock; fetching that pointer
implies an additional access to the contended cache line. However,
changing the data structure to a simple linear array of spinlocks
didn't help. So that whole idea seems to have died a painful death.

One other interesting result is that the data structure change neither
helped nor hurt on an EM64T machine with 2 physical (4 logical)
processors. This is also x86_64, but evidently the caching behavior
is totally different from Opterons.

One thing that did seem to help a little bit was padding the LWLocks
to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring
the array starts on a 32-byte boundary. This ensures that we won't have
any LWLocks crossing cache lines --- contended access to such an LWLock
would probably incur the sort of large penalty seen above, because you'd
be trading two cache lines back and forth not one. It seems that the
important locks are not split that way in CVS tip, because the gain
wasn't much, but I wonder whether some effect like this might explain
some of the unexplainable performance changes we've noticed in the past
(eg, in the dbt2 results). A seemingly unrelated small change in the
size of other data structures in shared memory might move things around
enough to make a performance-critical lock cross a cache line boundary.

On regular x86, the LWLock struct is by chance exactly 16 bytes, so
there's no alignment issue. But 64-bit platforms and platforms where
slock_t is bigger than char are exposed to this risk.

I'm going to go ahead and make that change, since it doesn't seem likely
to have any downside. It might be interesting to look into forcing
proper alignment of the shared buffer headers as well.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Darcy Buskermolen 2005-09-15 22:18:04 Re: US Census database (Tiger 2004FE) - 4.4G
Previous Message Michael Paesold 2005-09-15 21:05:19 Re: Spinlocks, yet again: analysis and proposed patches