From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Subject: | Re: StrategyGetBuffer optimization, take 2 |
Date: | 2013-08-19 13:30:09 |
Message-ID: | CAHyXU0yNtTUWC6Y626urDLvXY-n6DbeYY4K2mJz-3L-=2CearQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Aug 17, 2013 at 10:55 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Mon, Aug 5, 2013 at 11:49 AM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> *) What I think is happening:
>> I think we are again getting burned by getting de-scheduled while
>> holding the free list lock. I've been chasing this problem for a long
>> time now (for example, see:
>> http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
>> but not I've got a reproducible case. What is happening this:
>>
>> 1. in RelationGetBufferForTuple (hio.c): fire LockRelationForExtension
>> 2. call ReadBufferBI. this goes down the chain until StrategyGetBuffer()
>> 3. Lock free list, go into clock sweep loop
>> 4. while holding clock sweep, hit 'hot' buffer, spin on it
>> 5. get de-scheduled
>> 6. now enter the 'hot buffer spin lock lottery'
>> 7. more/more backends pile on, linux scheduler goes bezerk, reducing
>> chances of winning #6
>> 8. finally win the lottery. lock released. everything back to normal.
>
> This is an interesting theory, but where's the evidence? I've seen
> spinlock contention come from enough different places to be wary of
> arguments that start with "it must be happening because...".
Absolutely. My evidence is circumstantial at best -- let's call it a
hunch. I also do not think we are facing pure spinlock contention,
but something more complex which is a combination of spinlocks, the
free list lwlock, and the linux scheduler. This problem showed up
going from RHEL 5->6 which brought a lot of scheduler changes. A lot
of other things changed too, but the high sys cpu really suggests we
are getting some feedback from the scheduler.
> IMHO, the thing to do here is run perf record -g during one of the
> trouble periods. The performance impact is quite low. You could
> probably even set up a script that runs perf for five minute intervals
> at a time and saves all of the perf.data files. When one of these
> spikes happens, grab the one that's relevant.
Unfortunately -- that's not on the table. Dropping shared buffers to
2GB (thanks RhodiumToad) seems to have fixed the issue and there is
zero chance I will get approval to revert that setting in order to
force this to re-appear. So far, I have not been able to reproduce in
testing. By the way, this problem has popped up in other places too;
and the typical remedies are applied until it goes away :(.
> If you see that s_lock is where all the time is going, then you've
> proved it's a PostgreSQL spinlock rather than something in the kernel
> or a shared library. If you can further see what's calling s_lock
> (which should hopefully be possible with perf -g), then you've got it
> nailed dead to rights.
Well, I don't think it's that simple. So my plan of action is this:
1) Improvise a patch that removes one *possible* trigger for the
problem, or at least makes it much less likely to occur. Also, in
real world cases where usage_count is examined N times before
returning a candidate buffer, the amount of overall spinlocking from
buffer allocating is reduced by approximately (N-1)/N. Even though
spin locking is cheap, it's hard to argue with that...
2) Exhaustively performance test patch #1. I think this is win-win
since SGB clock sweep loop is quite frankly relatively un-optimized. I
don't see how reducing the amount of locking could hurt performance
but I've been, uh, wrong about these types of things before.
3) If a general benefit without downside is shown from #2, I'll simply
advance the patch for the next CF and see how things shake out. If and
when I feel like there's a decent shot at getting accepted, I may go
through the motions of setting up a patched server in production and
attempting shared buffers. But that's a long way off.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2013-08-19 13:32:21 | Re: Automatic Index Creation for Column Types |
Previous Message | Charles Sheridan | 2013-08-19 13:25:52 | Re: Automatic Index Creation for Column Types |