Re: Clock sweep not caching enough B-Tree leaf pages?

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Greg Stark <stark(at)mit(dot)edu>
Cc: Peter Geoghegan <pg(at)heroku(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Clock sweep not caching enough B-Tree leaf pages?
Date: 2014-04-17 14:18:43
Message-ID: CA+Tgmob-ym_qK7XitsqHHOJ6JEtDKS+f2B54Hp7RKrROJ+L0WA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 17, 2014 at 9:40 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Wed, Apr 16, 2014 at 12:44 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
>> This isn't a fundamental property of the usage-count idea; it's an
>> artifact of the fact that usage count decreases are tied to eviction
>> pressure rather than access pressure. For example, suppose we made a
>> rule that if the total usage counts of all buffers exceed 3 *
>> NBuffers, then every time you bump the usage count of a buffer from N
>> to N+1, you're required to advance the clock sweep far enough to
>> decrease the reference count of a buffer by one.
>
> This sounds like the right way to reason about it.
>
> From what I remember in school the idea with the clock sweep is to set
> the usage flags to the maximum whenever the buffer is used and
> decrement (actually iirc typically shift right) it when the clock
> sweep goes by. Ie, simulate a LRU where when the buffer is accessed it
> jumps to the head of the list and when the clock comes by it moves
> gradually down the list.
>
> What you're pointing out is that the clock might not come by very
> often resulting everything being at the head of the list. In that case
> I'm not clear it really matters what gets evicted though. And the cpu
> effort of running the clock n times sounds bad but doing the work
> earlier doesn't really change the amount of work being done, it just
> amortizes it over more calls.
>
> But if you want to do that it seems to me the way to do it is every
> time a buffer is pinned set to the maximum and then run the clock
> max_value - previous_value. So the total usage counts of all buffers
> remains constant. If that results in contention one way to reduce it
> is to do this probabilistically. Run the clock 1% of the time but run
> it 100x as much as you would normally.
>
> But I think you've misidentified the problem and what those other
> algorithms are trying to solve. The problem is not that Postgres will
> pick a bad buffer to evict. If all the buffers have been since the
> last time the clock came around then they're all "hot" anyways and it
> doesn't really matter which one we evict. The problem is that we
> expend an inordinate amount of work finding the few non-hot buffers.
> When you have a really large amount of memory and 99.9% of it is hot
> but 0.1% is whatever random non-hot page was needed last then there's
> an obvious buffer to evict when you need a new one. But we spend a lot
> of work decrementing every hot buffer's usage count 4 times only to
> have them immediately incremented again just to find the 1 buffer
> where the usage count was 4 or 3. The goal of these algorithms that
> divide the buffers into groups is to avoid having to do so much work
> to find the colder buffers. Once the hot buffers migrate to the hot
> pool we only need to run the clock there when we find we have new hot
> pages that we want to promote. All the thrashing in the cold pool can
> be more efficient because there's many fewer pages to consider.

Well, I think Peter has proved that PostgreSQL *will* pick a bad
buffer to evict. The proof is that when he changed the choice of
buffer to evict, he got a significant performance improvement.

I also believe this to be the case on first principles and my own
experiments. Suppose you have a workload that fits inside
shared_buffers. All of the usage counts will converge to 5. Then,
somebody accesses a table that is not cached, so something's got to be
evicted. Because all the usage counts are the same, the eviction at
this point is completely indiscriminate. We're just as likely to kick
out a btree root page or a visibility map page as we are to kick out a
random heap page, even though the former have probably been accessed
several orders of magnitude more often. That's clearly bad. On
systems that are not too heavily loaded it doesn't matter too much
because we just fault the page right back in from the OS pagecache.
But I've done pgbench runs where such decisions lead to long stalls,
because the page has to be brought back in from disk, and there's a
long I/O queue; or maybe just because the kernel thinks PostgreSQL is
issuing too many I/O requests and makes some of them wait to cool
things down.

Of course, the overhead of repeated clock sweeps to push down the
usage counts isn't a great thing either. I'm not saying that isn't a
problem. But I think bad decisions about what to evict are also a
problem.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2014-04-17 14:26:48 Re: Buildfarm "master-next" branch?
Previous Message Andrew Dunstan 2014-04-17 14:15:20 Re: assertion failure 9.3.4