Re: [HACKERS] Clock with Adaptive Replacement

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Clock with Adaptive Replacement
Date: 2018-04-26 20:51:11
Message-ID: CAH2-WzkKb5Q3dQ2-uVarhPOEP7UdZ8=jM72Db1oPSftFk=fRFg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Apr 26, 2018 at 10:39 AM, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think it is. We haven't done anything to address it. I think if we
> want to move to direct I/O -- which may be something we need or want
> to do -- we're going to have to work a lot harder at making good page
> eviction decisions.

+1

> Your patch to change the page eviction algorithm
> didn't help noticeably once we eliminated the contention around buffer
> eviction, but that's just because the cost of a bad eviction went
> down, not because we stopped doing bad evictions.

The goal of that patch, which was literally written over a wet
weekend, was to make people realize that we were missing something
there. I think that it accomplished that much.

> I think it would be
> interesting to insert a usleep() call into mdread() and then test
> buffer eviction various algorithms with that in place.

That could be a useful strategy.

> I'm personally not very excited about making rules like "index pages
> are more valuable than heap pages". Such rules will in many cases be
> true, but it's easy to come up with cases where they don't hold: for
> example, we might run pgbench for a while and then stop running
> pgbench and start running big sequential scans for reporting purposes.

I am not in favor of this general approach. Actually, I'm willing to
come out against it strongly right now: let's not teach the buffer
manager to distinguish between the AM of a block.

> We don't want to artificially pin the index buffers in shared_buffers
> just because they're index pages; we want to figure out which pages
> really matter. Now, I realize that you weren't proposing (and
> wouldn't propose) a rule that index pages never get evicted, but I
> think that favoring index pages even in some milder way is basically a
> hack. Index pages aren't *intrinsically* more valuable; they are more
> valuable because they will, in many workloads, be accessed more often
> than heap pages. A good algorithm ought to be able to figure that out
> based on the access pattern, without being explicitly given a hint, I
> think.

I won't argue against any of that, but I think that it's rather
nuanced, and that the nuance probably matters a lot.

First of all, we must have a sense of proportion about what is likely
to be true in the average case. I mentioned to Thomas that the pgbench
accounts table is 6x the size its primary key, and that with a uniform
distribution + point lookups the leaf pages literally have 6x the
utility of the heap pages. It really is that simple there. Also,
whatever the distribution of the point lookups is, "cache more leaf
pages" is probably going to be the effect that a better caching
strategy would have. Even 6x is probably an underestimate of the
utility of leaf pages compared to heap pages in the average case. I
bet it's more like 10x for a typical OLTP app.

Second of all, we must have a sense of perspective about what we're
trying to do here, which is to predict the future based on the past.
The best outcome we can hope for is to lose less on average without
hurting anything that already works well enough.

> I believe the root of the problem here is that the usage count we have
> today does a very poor job distinguishing what's hot from what's not.

I think that the root of the problem is that we don't remember
anything about evicted buffers. The same block can bounce in and out
of shared_buffers repeatedly, and we don't recognize that. There are
probably second-order effects from sizing shared_buffers. We
needlessly tie the number of buffers to our capacity to remember
things about block popularity. That would be bad if we had no FS
cache, but with the FS cache it seems awful.

I share the intuition that we need something adaptive, that can be
analysed and understood relatively easily, and has little to no magic.
That's why I'm willing to say now that we shouldn't do anything based
on the type of the blocks involved. However, I think that that model
might work about as well, and that this matters because it provides
motivating examples. Surely you're right when you say that index
blocks are not intrinsically more useful than any of type of block,
but if they're 10x more useful on average, and 20x more useful at
other times, then a person could be forgiven for modelling them as
intrinsically more useful.

It's also useful to examine why a random replacement strategy is
merely mediocre to bad, and not abysmal. That's what every replacement
strategy ends up being with enough cache pressure anyway. This makes
it more reasonable to focus on the average case than it is in other
places.

> There have been previous experiments around making usage_count use
> some kind of a log scale: we make the maximum, say, 32, and the clock
> hand divides by 2 instead of subtracting 1. I don't think those
> experiments were enormously successful and I suspect that a big part
> of the reason is that it's still pretty easy to get to a state where
> the counters are maxed out for a large number of buffers, and at that
> point you can't distinguish between those buffers any more: they all
> look equally hot. We need something better. If a system like this is
> working properly, things like interior index pages and visibility map
> pages ought to show up as fiery hot on workloads where the index or
> visibility map, as the case may be, is heavily used.

I don't think that internal index pages and the visibility map are the
problem, since there is so few of them, and they are accessed at least
hundreds of times more frequently than the leaf pages.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2018-04-26 22:18:10 Re: Is there a memory leak in commit 8561e48?
Previous Message Tom Lane 2018-04-26 20:39:44 Re: obsoleting plpython2u and defaulting plpythonu to plpython3u