From: | Jim Nasby <jim(at)nasby(dot)net> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: BufFreelistLock |
Date: | 2010-12-15 23:02:24 |
Message-ID: | 4C7C05F7-9360-4709-99EA-57E7B58199AB@nasby.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Dec 15, 2010, at 2:40 PM, Jeff Janes wrote:
> On Tue, Dec 14, 2010 at 1:42 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>
>> On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote:
>>> I wouldn't expect an increase in shared_buffers to make contention on
>>> BufFreelistLock worse. If the increased buffers are used to hold
>>> heavily-accessed data, then you will find the pages you want in
>>> shared_buffers more often, and so need to run the clock-sweep less
>>> often. That should make up for longer sweeps. But if the increased
>>> buffers are used to hold data that is just read once and thrown away,
>>> then the clock sweep shouldn't need to sweep very far before finding a
>>> candidate.
>>
>> Well, we're talking about a working set that's between 96 and 192G, but
>> only 8G (or 28G) of shared buffers. So there's going to be a pretty
>> large amount of buffer replacement happening. We also have
>> 210 tables where the ratio of heap buffer hits to heap reads is
>> over 1000, so the stuff that is in shared buffers probably keeps
>> usage_count quite high. Put these two together, and we're probably
>> spending a fairly significant amount of time running the clock sweep.
>
> The thing that makes me think the bottleneck is elsewhere is that
> increasing from 8G to 28G made it worse. If buffer unpins are
> happening at about the same rate, then my gut feeling is that the
> clock sweep has to do about the same amount of decrementing before it
> gets to a free buffer under steady state conditions. Whether it has
> to decrement 8G in buffers three and a half times each, or 28G of
> buffers one time each, it would do about the same amount of work.
> This is all hand waving, of course.
While we're waving hands... I think the issue is that our working set size is massive. That means that there will be a lot of activity driving usage_count up on buffers. Increasing shared buffers will help reduce that effect as they begin to contain more and more of the working set, but I suspect that going from 8G to 28G wouldn't have made much difference. That means that we now have *more* buffers with a high usage count that the sweep has to slog through.
Anyway, once I'm able to get the buffer stats contrib module installed we'll have a better idea of what's actually happening.
>> Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs just grabbing something off of the free list (assuming we had separate locks for the two operations).
>
> But do we actually know that? Doing a clock sweep is only a lot of
> overhead if it has to pass over many buffers in order to find a good
> one, and we don't know the numbers on that. I think you can sweep a
> lot of buffers for the overhead of a single contended lock.
>
> If the sweep and the freelist had separate locks, you still need to
> lock the freelist to add to it things discovered during the sweep.
I'm hoping we could actually use separate locks for adding and removing, assuming we discover this is actually a consideration.
>> Does anyone know what the overhead of getting a block from the filesystem cache is?
>
> I did tests on this a few days ago. It took on average 20
> microseconds per row to select one row via primary key when everything
> was in shared buffers.
> When everything was in RAM but not shared buffers, it took 40
> microseconds. Of this, about 10 microseconds were the kernel calls to
> seek and read from OS cache to shared_buffers, and the other 10
> microseconds is some kind of PG overhead, I don't know where. The
> timings are per select, not per page, and one select usually reads two
> pages, one for the index leaf and one for the table.
>
> This was all single-client usage on 2.8GHz AMD Opteron. Not all the
> components of the timings will scale equally with additional clients
> on additional CPUs of course. I think the time spent in the kernel
> calls to do the seek and read will scale better than most other parts.
Interesting info. I wonder if that 10us of unknown overhead was related to shared buffers. Do you know if you had room in shared buffers when you ran that test? It would be interesting to see the differences between having buffers on the free list, no buffers on the free list but buffers with 0 usage count (though, I'm not sure how you could set that up), and shared buffers with high usage count.
>> BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance.
>
> As long as you are adding #define BGW_DEBUG and recompiling, you might
> as well promote all the DEBUG2 in src/backend/storage/buffer/bufmgr.c
> to DEBUG1 or LOG. I think this will only generate a couple log
> message per bgwriter_delay. That should be tolerable, especially for
> testing purposes.
Good ideas; I'll try to get that in place once we can benchmark, though it'll be easier to get pg_buffercache in place, so I'll focus on that first.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-12-15 23:14:53 | Re: Fix for seg picksplit function |
Previous Message | Alvaro Herrera | 2010-12-15 23:00:03 | Re: range intervals in window function frames |