From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: our buffer replacement strategy is kind of lame |
Date: | 2011-08-12 10:53:36 |
Message-ID: | CA+U5nMKzRtapt1-aOzdWqArin3PRZ-Q5TcHFej66GC7EtN98_g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Aug 12, 2011 at 5:05 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> The general problem here is that we are not very smart about handling
> workloads with weak locality - i.e. the working set is larger than
> shared buffers. If the working set fits in shared_buffers, we will
> keep it there, and it will be strongly protected against eviction.
> But if the working set *doesn't* fit in shared buffers, then we fall
> back on some not-very-smart heuristics to decide what to do: keep
> pages involved in index scans, ditch those involved in sequential
> scans.
>
> It seems to me that we need a smarter algorithm. It seems that NetBSD
> has adopted the use of Clock-Pro, and there are some discussions of it
> being used in Linux as well, though I'm not sure whether that's
> actually (still?) the case. Paper is here, although I find the
> description of the algorithm to be approximately clear as mud:
>
> http://www.cse.ohio-state.edu/~fchen/paper/papers/usenix05.pdf
>
> One of the important notions which many of the algorithms in the
> literature seem to hit on in one way or another is that you mustn't go
> and decide that all of your buffers - or nearly all your buffers - are
> hot. That's really equivalent to saying that none of your buffers are
> hot; and it's also pretty easy to see that such a scenario would be
> disastrous for our current algorithm, which - if everything in
> shared_buffers has usage-count 5 all the time - will have to clock
> sweep a long way each time it needs to allocate a buffer. In fact,
> the more of your buffers want to be hot (and thus hard to evict), the
> fewer of them should actually be allowed to acquire that status.
This is something I've been actively working on. I was on the trail of
possible reasons that increasing the size of shared buffers would slow
down PostgreSQL.
The worst case behaviour of the current freelist code is that it can
take up to 5 * shared_buffers checks before identifying a victim
buffer. That occurs when we have a working set exactly matching size
of shared buffers. There are problems when large portions of shared
buffers are very popular, so that the free-ish buffers are a small
proportion of the total buffers - this case gets worse if the
distribution of buffer allocations is non-random so you get say 10
free-ish blocks together, followed by N-10 very popular ones. That
makes every 11th freelist request take ~O(shared_buffers). These
problems will show themselves as sporadic BufFreelistLock contention.
Sporadic is the problem here since it may make the contention hard to
measure in aggregate, yet with real impact on performance.
For us to benefit from increased shared_buffers we need to have an
algorithm that is O(k) in its worst case behaviour and O(1) in its
best case behaviour - the latter aspect is provided by a correctly
functioning bgwriter. At the moment, we do nothing to enforce O(k)
behaviour.
The following patch implements O(k) behaviour using a heuristic limit.
My first attempt was a fixed heuristic limit, but that was less
suitable because there are actually 2 requirements. The value of k
must be small to have a measurable impact on the smoothness of
StrategyGetBuffer(), but k must also be fairly large to ensure the
choice of victim is a relatively good one. So the way to balance those
conflicting objectives is to have a progressively increasing search
window. An just to repeat, k is not a % of shared buffers, which would
still give O(N) behaviour.
I've not been reading "the literature", given the problems we had in
2004/5 regarding patents in this area. I also think that since we rely
on the underlying filesystem for cacheing that we don't have exactly
the same problem as other systems.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
Attachment | Content-Type | Size |
---|---|---|
freelist_ok.v2.patch | application/octet-stream | 2.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Korotkov | 2011-08-12 10:59:55 | Re: WIP: Fast GiST index build |
Previous Message | Shigeru Hanada | 2011-08-12 10:24:03 | Change format of FDW options used in \d* commands (was: Re: per-column FDW options, v5) |