Quick Links

Re: our buffer replacement strategy is kind of lame

From:	Jim Nasby <jim(at)nasby(dot)net>
To:	Greg Stark <stark(at)mit(dot)edu>
Cc:	Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: our buffer replacement strategy is kind of lame
Date:	2011-08-15 23:26:32
Message-ID:	34851930-E7F3-4EF9-BE14-1B5ABAE375F3@nasby.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Aug 13, 2011, at 3:40 PM, Greg Stark wrote:
> It does kind of seem like your numbers indicate we're missing part of
> the picture though. The idea with the clock sweep algorithm is that
> you keep approximately 1/nth of the buffers with each of the n values.
> If we're allowing nearly all the buffers to reach a reference count of
> 5 then you're right that we've lost any information about which
> buffers have been referenced most recently.

One possible missing piece here is that OS clock-sweeps depend on the clock hand to both increment and decrement the usage count. The hardware sets a bit any time a page is accessed; as the clock sweeps in increases usage count if the bit is set and decreases it if it's clear. I believe someone else in the thread suggested this, and I definitely think it's worth an experiment. Presumably this would also ease some lock contention issues.

There is another piece that might be relevant... many (most?) OSes keep multiple lists of pages. FreeBSD for example contains these page lists (http://www.freebsd.org/doc/en/articles/vm-design/article.html) Full description follows, but I think the biggest take-away is that there is a difference in how pages are handled once they are no longer active based on whither the page is dirty or not.

Active: These pages are actively in use and are not currently under consideration for eviction. This is roughy equivalent to all of our buffers with a usage count of 5.

When an active page's usage count drops to it's minimum value, it will get unmapped from process space and moved to one of two queues:

Inactive: DIRTY pages that are eligible for eviction once they've been written out.

Cache: CLEAN pages that may be immediately reclaimed

Free: A small set of pages that are basically the tail of the Cache list. The OS *must* maintain some pages on this list to support memory needed during interrupt handling. The size of this list is typically kept very small, and I'm not sure if non-interrupt processing will pull from this list.

It's important to note that the OS can pull a page back out of the Inactive and Cache lists back into Active very cheaply.

I think there are two interesting points here. First: after a page has been determined to no longer be in active use it goes into inactive or cache based on whether it's dirty. ISTM that allows for much better scheduling of the flushing of dirty pages. That said; I'm not sure how much that would help us due to checkpoint requirements.

Second: AFAIK only the Active list has a clock sweep. I believe the others are LRU (the mentioned URL refers to them as queues). I believe this works well because if a page faults it just needs to be removed from whichever queue it is in, added to the Active queue, and mapped back into process space.
--
Jim C. Nasby, Database Architect jim(at)nasby(dot)net
512.569.9461 (cell) http://jim.nasby.net

In response to

Re: our buffer replacement strategy is kind of lame at 2011-08-13 20:40:15 from Greg Stark

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Greg Smith	2011-08-15 23:37:17	Re: index-only scans
Previous Message	Joachim Wieland	2011-08-15 22:46:14	Re: synchronized snapshots