Re: Page replacement algorithm in buffer cache

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Ants Aasma <ants(at)cybertec(dot)at>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page replacement algorithm in buffer cache
Date: 2013-03-23 19:34:23
Message-ID: CAHyXU0xsN4p1H4CqQMCR0zDnXNNX=xKFaxKSJ19XUsnF1ys84Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Mar 22, 2013 at 7:27 PM, Ants Aasma <ants(at)cybertec(dot)at> wrote:
> On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> well if you do a non-locking test first you could at least avoid some
>> cases (and, if you get the answer wrong, so what?) by jumping to the
>> next buffer immediately. if the non locking test comes good, only
>> then do you do a hardware TAS.
>>
>> you could in fact go further and dispense with all locking in front of
>> usage_count, on the premise that it's only advisory and not a real
>> refcount. so you only then lock if/when it's time to select a
>> candidate buffer, and only then when you did a non locking test first.
>> this would of course require some amusing adjustments to various
>> logical checks (usage_count <= 0, heh).
>
> Moreover, if the buffer happens to miss a decrement due to a data
> race, there's a good chance that the buffer is heavily used and
> wouldn't need to be evicted soon anyway. (if you arrange it to be a
> read-test-inc/dec-store operation then you will never go out of
> bounds)

yeah. There's something to be said to have an upper bound in the
length of time to get a page out (except in the special case when most
of them are pinned). Right now, any page contention on a buffer
header for any reason can shut down buffer allocation, and that's just
not good. It's obviously not very likely to happen but I think it can
does does happen. The more I think about it the more I think's a bad
idea to spin during buffer allocation for any reason, ever.

> However, clocksweep and usage_count maintenance is not what is
> causing contention because that workload is distributed. The issue is
> pinning and unpinning. There we need an accurate count and there are
> some pages like index roots that get hit very heavily. Things to do
> there would be in my opinion convert to a futex based spinlock so when
> there is contention it doesn't completely kill performance and then
> try to get rid of the contention. Converting to lock-free pinning
> won't help much here as what is killing us here is the cacheline
> bouncing.

Yup -- futexes are another way to go. They are linux specific though.

> One way to get rid of contention is the buffer nailing idea that
> Robert came up with. If some buffer gets so hot that maintaining
> refcount on the buffer header leads to contention, promote that buffer
> to a nailed status, let everyone keep their pin counts locally and
> sometime later revisit the nailing decision and if necessary convert
> pins back to the buffer header.

Yeah this is a more general (albeit more complicated) solution and
would likely be fantastic. Is it safe to assume that refcounting is
the only likely cause of contention?

> One other interesting idea I have seen is closeable scalable nonzero
> indication (C-SNZI) from scalable rw-locks [1]. The idea there is to
> use a tree structure to dynamically stripe access to the shared lock
> counter when contention is detected. Downside is that considerable
> amount of shared memory is needed so there needs to be some way to
> limit the resource usage. This is actually somewhat isomorphic to the
> nailing idea.
>
> The issue with the current buffer management algorithm is that it
> seems to scale badly with increasing shared_buffers. I think the
> improvements should concentrate on finding out what is the problem
> there and figuring out how to fix it. A simple idea to test would be
> to just partition shared buffers along with the whole clock sweep
> machinery into smaller ones, like the buffer mapping hash tables
> already are. This should at the very least reduce contention for the
> clock sweep even if it doesn't reduce work done per page miss.
>
> [1] http://people.csail.mit.edu/mareko/spaa09-scalablerwlocks.pdf

I'll have to take a look. Removing *all spinning* from from page
allocation though feels like it might be worthwhile to test (got to
give some bonus points for being a very local change and simple to
implement). I wonder if with more shared buffers you tend to sweep
more buffers per allocation. (IIRC Jeff J was skeptical of that).

merlin

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2013-03-23 20:07:42 Re: Page replacement algorithm in buffer cache
Previous Message Jeff Janes 2013-03-23 19:29:24 Re: Page replacement algorithm in buffer cache