From: | Jim Nasby <jim(at)nasby(dot)net> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Alvaro Herrera <alvherre(at)commandprompt(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: BufFreelistLock |
Date: | 2010-12-14 21:42:06 |
Message-ID: | DC555169-6758-4996-B51C-E9B3845385BC@nasby.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Dec 14, 2010, at 11:08 AM, Jeff Janes wrote:
> On Sun, Dec 12, 2010 at 6:48 PM, Jim Nasby <jim(at)nasby(dot)net> wrote:
>>
>> BTW, when we moved from 96G to 192G servers I tried increasing shared buffers from 8G to 28G and performance went down enough to be noticeable (we don't have any good benchmarks, so I cant really quantify the degradation). Going back to 8G brought performance back up, so it seems like it was the change in shared buffers that caused the issue (the larger servers also have 24 cores vs 16).
>
> What kind of work load do you have (intensity of reading versus
> writing)? How intensely concurrent is the access?
It writes at the rate of ~3-5MB/s, doing ~700TPS on average. It's hard to judge the exact read mix, because it's running on a 192G server (actually, 512G now, but 192G when I tested). The working set is definitely between 96G and 192G; we saw a major performance improvement last year when we went to 192G, but we haven't seen any improvement moving to 512G.
We typically have 10-20 active queries at any point.
>> My immediate thought was that we needed more lock partitions, but I haven't had the chance to see if that helps. ISTM the issue could just as well be due to clock sweep suddenly taking over 3x longer than before.
>
> It would surprise me if most clock sweeps need to make anything near a
> full pass over the buffers for each allocation (but technically it
> wouldn't need to do that take 3x longer. It could be that the
> fraction of a pass it needs to make is merely proportional to
> shared_buffers. That too would surprise me, though). You could
> compare the number of passes with the number of allocations to see how
> much sweeping is done per allocation. However, I don't think the
> number of passes is reported anywhere, unless you compile with #define
> BGW_DEBUG and
> run with debug2.
>
> I wouldn't expect an increase in shared_buffers to make contention on
> BufFreelistLock worse. If the increased buffers are used to hold
> heavily-accessed data, then you will find the pages you want in
> shared_buffers more often, and so need to run the clock-sweep less
> often. That should make up for longer sweeps. But if the increased
> buffers are used to hold data that is just read once and thrown away,
> then the clock sweep shouldn't need to sweep very far before finding a
> candidate.
Well, we're talking about a working set that's between 96 and 192G, but only 8G (or 28G) of shared buffers. So there's going to be a pretty large amount of buffer replacement happening. We also have 210 tables where the ratio of heap buffer hits to heap reads is over 1000, so the stuff that is in shared buffers probably keeps usage_count quite high. Put these two together, and we're probably spending a fairly significant amount of time running the clock sweep.
Even excluding our admittedly unusual workload, there is still significant overhead in running the clock sweep vs just grabbing something off of the free list (assuming we had separate locks for the two operations). Does anyone know what the overhead of getting a block from the filesystem cache is? I wonder how many buffers you can move through in the same amount of time. Put another way, at some point you have to check enough buffers to find a free one that you just doubled the amount of time it takes to get data from the filesystem cache into a shared buffer.
> But of course being able to test would be better than speculation.
Yeah, I'm working on getting pg_buffercache installed so we can see what's actually in the cache.
Hmm... I wonder how hard it would be to hack something up that has a separate process that does nothing but run the clock sweep. We'd obviously not run a hack in production, but we're working on being able to reproduce a production workload. If we had a separate clock-sweep process we could get an idea of exactly how much work was involved in keeping free buffers available.
BTW, given our workload I can't see any way of running at debug2 without having a large impact on performance.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-12-14 21:55:05 | Re: unlogged tables vs. GIST |
Previous Message | Florian Pflug | 2010-12-14 21:34:53 | Re: Triggered assertion "!(tp.t_data->t_infomask & HEAP_XMAX_INVALID)" in heap_delete() on HEAD [PATCH] |