Re: Page replacement algorithm in buffer cache

From: Jim Nasby <jim(at)nasby(dot)net>
To: Merlin Moncure <mmoncure(at)gmail(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Ants Aasma <ants(at)cybertec(dot)at>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Greg Stark <stark(at)mit(dot)edu>, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Page replacement algorithm in buffer cache
Date: 2013-04-01 23:09:07
Message-ID: 515A1393.9090908@nasby.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/1/13 4:55 PM, Merlin Moncure wrote:
> On Mon, Apr 1, 2013 at 4:09 PM, Andres Freund<andres(at)2ndquadrant(dot)com> wrote:
>> >On 2013-04-01 08:28:13 -0500, Merlin Moncure wrote:
>>> >>On Sun, Mar 31, 2013 at 1:27 PM, Jeff Janes<jeff(dot)janes(at)gmail(dot)com> wrote:
>>>> >> >On Friday, March 22, 2013, Ants Aasma wrote:
>>>>> >> >>
>>>>> >> >>On Fri, Mar 22, 2013 at 10:22 PM, Merlin Moncure<mmoncure(at)gmail(dot)com>
>>>>> >> >>wrote:
>>>>>> >> >> >well if you do a non-locking test first you could at least avoid some
>>>>>> >> >> >cases (and, if you get the answer wrong, so what?) by jumping to the
>>>>>> >> >> >next buffer immediately. if the non locking test comes good, only
>>>>>> >> >> >then do you do a hardware TAS.
>>>>>> >> >> >
>>>>>> >> >> >you could in fact go further and dispense with all locking in front of
>>>>>> >> >> >usage_count, on the premise that it's only advisory and not a real
>>>>>> >> >> >refcount. so you only then lock if/when it's time to select a
>>>>>> >> >> >candidate buffer, and only then when you did a non locking test first.
>>>>>> >> >> > this would of course require some amusing adjustments to various
>>>>>> >> >> >logical checks (usage_count <= 0, heh).
>>>>> >> >>
>>>>> >> >>Moreover, if the buffer happens to miss a decrement due to a data
>>>>> >> >>race, there's a good chance that the buffer is heavily used and
>>>>> >> >>wouldn't need to be evicted soon anyway. (if you arrange it to be a
>>>>> >> >>read-test-inc/dec-store operation then you will never go out of
>>>>> >> >>bounds) However, clocksweep and usage_count maintenance is not what is
>>>>> >> >>causing contention because that workload is distributed. The issue is
>>>>> >> >>pinning and unpinning.
>>>> >> >
>>>> >> >
>>>> >> >That is one of multiple issues. Contention on the BufFreelistLock is
>>>> >> >another one. I agree that usage_count maintenance is unlikely to become a
>>>> >> >bottleneck unless one or both of those is fixed first (and maybe not even
>>>> >> >then)
>>> >>
>>> >>usage_count manipulation is not a bottleneck but that is irrelevant.
>>> >>It can be affected by other page contention which can lead to priority
>>> >>inversion. I don't be believe there is any reasonable argument that
>>> >>sitting and spinning while holding the BufFreelistLock is a good idea.
>> >
>> >In my experience the mere fact of (unlockedly, but still) accessing all the
>> >buffer headers can cause noticeable slowdowns in write only/mostly workloads with
>> >big amounts of shmem.
>> >Due to the write only nature large amounts of the buffers have a similar
>> >usagecounts (since they are infrequently touched after the initial insertion)
>> >and there are no free ones around so the search for a buffer frequently runs
>> >through*all* buffer headers multiple times till it decremented all usagecounts
>> >to 0. Then comes a period where free buffers are found easily (since all
>> >usagecounts from the current sweep point onwards are zero). After that it
>> >starts all over.
>> >I now have seen that scenario multiple times:(
> Interesting -- I was thinking about that too, but it's a separate
> problem with a different trigger. Maybe a bailout should be in there
> so that after X usage_count adjustments the sweeper summarily does an
> eviction, or maybe the "max" declines from 5 once per hundred buffers
> inspected or some such.

What's the potential downside on that though? IE: what happens if this scheme suddenly evicts the root page on a heavily used index? You'll suddenly have a ton of stuff blocked waiting for that page to come back in.

This is a use case that I think would benefit greatly from a background process that keeps pages in the free list.

That said, I now suspect that your "frightened turtle" approach would be of higher value than "bgfreelist"... but I suspect we'll ultimately want both of them for different reasons.
--
Jim C. Nasby, Data Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim Nasby 2013-04-01 23:12:52 Re: Page replacement algorithm in buffer cache
Previous Message Jim Nasby 2013-04-01 22:56:19 Re: Page replacement algorithm in buffer cache