Re: Page replacement algorithm in buffer cache

From: Merlin Moncure <mmoncure(at)gmail(dot)com>
To: Bruce Momjian <bruce(at)momjian(dot)us>
Cc: Robert Haas <robertmhaas(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page replacement algorithm in buffer cache
Date: 2013-04-01 20:59:26
Message-ID: CAHyXU0ycV+S33vL7wM+78vWDb=x3X=B-JKpYb66eHzQr5n2tQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Apr 1, 2013 at 3:32 PM, Bruce Momjian <bruce(at)momjian(dot)us> wrote:
> On Mon, Apr 1, 2013 at 11:55:07AM -0500, Merlin Moncure wrote:
>> > In fact, BufFreelistLock is really misnamed, because for the most
>> > part, the "free list" as we implement is going to be empty. What the
>> > BufFreelistLock is really doing is serializing the process of scanning
>> > for a free buffer. I think THAT is the problem. If we could arrange
>> > things so as to hold BufFreelistLock only for the amount of time
>> > needed to remove a buffer from a freelist ... we'd probably buy
>> > ourselves quite a bit.
>>
>> right. I'm imaging a buffer scan loop that looks something like
>> (uncompiled, untested) this. "TryLockBufHdr" does a simple TAS
>> without spin, returning the lock state (well, true if it acquired the
>> lock). usage_count is specifically and deliberately adjusted without
>> having a lock on the buffer header (this would require some careful
>> testing and possible changes elsewhere):
>
> TAS does a CPU 'lock' instruction which affects the cpu cache. Why not
> just read the value with no lock?

check again, that's exactly what it does. Note the old implementation
did a LockBufHdr() before examining refcount. The key logic is here:

if (buf->refcount == 0)
{
if (buf->usage_count > 0)
{
buf->usage_count--;
trycounter = NBuffers;
}
else
{
if (TryLockBufHdr(buf)
{

So we do an unlocked read of refcount and immediately bail if the
buffer is "locked" according to the cpu cache. Then we check (still
unlocked) usage_count and decrement it: Our adjustment may be lost,
but so what? Finally, we attempt one (and only one) cache line lock
(via TAS_SPIN) of the buffer and again bail if we see any problems
there. Thus, it's impossible to get stuck in a potentially yielding
spin while holding the free list lock.

I dub this: "The Frightened Turtle" strategy of buffer allocation.
It's an idea based on my research trying to solve Vlad's issue of
having server stalls during read-only loads (see here:
http://postgresql.1045698.n5.nabble.com/High-SYS-CPU-need-advise-td5732045.html)
for a general backgrounder. The idea may not actually fix his issue,
or there may be other aggravating aspects such as the
always-capricious linux scheduler, but I'm suspicious.

merlin

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2013-04-01 21:09:20 Re: Page replacement algorithm in buffer cache
Previous Message Magnus Hagander 2013-04-01 20:54:13 Re: "Orphaned" files after initdb