Re: [HACKERS] Clock with Adaptive Replacement

From: Юрий Соколов <funny(dot)falcon(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Stephen Frost <sfrost(at)snowman(dot)net>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Konstantin Knizhnik <k(dot)knizhnik(at)postgrespro(dot)ru>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: [HACKERS] Clock with Adaptive Replacement
Date: 2018-04-29 19:39:20
Message-ID: CAL-rCA3=wu6Lzx5uf9m2BP24Venmnc5TG2Vs5FZ4U=VKqKgLUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

вс, 29 апр. 2018 г., 2:17 Peter Geoghegan <pg(at)bowt(dot)ie>:

> On Sat, Apr 28, 2018 at 7:39 AM, Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > Perhaps I'm misunderstanding the case here, but with the ring buffer we
> > use for sequential access, aren't we already discouraging keeping heap
> > pages around, particularly when they're coming from a sequential scan?
> >
> > I'm not suggesting that's a bad thing either, in fact it seems like a
> > generally good thing and I don't think we get many complaints about it,
> > I'm just trying to point out that we already have a distinction between
> > heap-pages-from-seq-scans and index pages (and heap pages from using
> > those indexes) and therefore I'm not convinced by an argument that we
> > shouldn't ever treat pages from the heap differently than pages from the
> > indexes.
>
> The argument is not a slam dunk, to be sure. Since you bring it up, I
> happen to think that the ring buffer/strategy scan thing is too
> aggressive in a lot of cases. Why not cache more when the existing
> contents of shared_buffers are of marginal importance? Concern about a
> kind of "tyranny of the average case" seems fairly well founded to me.
> I am most concerned about improving the average case in particular,
> but also very concerned about not regressing atypical or bursty cases.
>
> > This could change when considering an environment where we aren't trying
> > to take advantage of the kernel's buffer cache and things like its
> > read-ahead, and maybe that's what you're getting at here, but I would
> > also point out that we have more inherent understanding of how pages are
> > likely to be accessed than the kernel does and we shouldn't be
> > explicitly trying to avoid taking advantage of that knowledge.
>
> I'm definitely in favor of taking advantage of that knowledge. I just
> think that we can derive the choice of which buffer to evict from high
> level principles, that are relatively easy to understand in isolation.
> This may matter later, when production system behavior needs to be
> analyzed. I might agree to the idea of artificially favoring one type
> of block over another if I thought that there was no better way, but I
> think that there is. I also think that a generalized approach isn't
> that much more difficult, particularly when you factor in the need to
> convince other people.
>
> > I agree that we should definitely be thinking about data density here-
> > and that's another case where we have more information about the data
> > than the kernel does (which I compare to as it tries to predict value
> > also but doesn't have as much information).
>
> Even my 10x figure is, in a way, very conservative. B-Tree index pages
> have another obvious though potentially overlooked advantage, which is
> that they naturally have spatial or logical locality. The only kind of
> locality that heap pages can ever have is temporal locality. I
> hesitate to say too much more about what must be true on average,
> across the universe of Posgres installations, but I can safely say
> this much: it's totally pedestrian and normal for *leaf* pages to have
> more than 100x the utility of heap pages in some applications, since a
> leaf page is not just "denser"; it may also have significant spatial
> or logical locality, where heap pages have none at all.
>
> There could very easily be an enormous range of usefulness among
> buffers in shared_buffers. And yet, we only recognize the increments
> that usage_count can represent (this is before we even think about the
> actual problems with how usage_count is maintained). We're never going
> to be able to give *radically* more useful leaf pages a concomitant
> advantage during buffer eviction if we have to use some rule of thumb.
>
> > Another consideration here
> > to think about would be internal pages rather than just leaf pages-
> > those have an even higher likelihood of being needed again quickly as
> > they need to be traversed much more than leaf pages.
>
> As I said, I'm not too worried about that case, because there are so
> few internal pages -- we're probably talking about a small fraction of
> 1% of the total. It seems unlikely that there'd be so much cache
> pressure that the cost of reading in an internal block is much worse
> than the cost of reading in any other block. Main memory sizes are big
> enough that we should be able to think about things at the level of
> minutes (maybe we can't right now, but we *should* be able to).
>
> (This probably doesn't matter much when it comes to finding a way
> forward, so we can agree to disagree here without consequence.)
>
> > This is definitely an interesting and valuable point. Having pages
> > bouncing in and out of shared buffers means that we aren't properly
> > tracking their value. Just brainstorming, but I wonder if there's a way
> > we could track their value even when we've evicted them, without slowing
> > down the entire system. Not having any great ideas off-hand for that
> > but some kind of "we last accessed this block an hour ago" and "we last
> > accessed this block half a millisecond ago" might be interesting to try
> > to avoid such ping-ponging. I wonder if there are known strategies for
> > addressing this, perhaps by having a data structure which effectively
> > tracks hit rates for all potential blocks...
>
> Well, ARC has its ghost lists (so does CAR, as Thomas pointed out just
> now). ARC/CAR is adaptive, in that it drifts towards favoring either
> recency or frequency for the system as a whole. Provided we can make
> that scalable, it will probably work well for us. The bi-modal nature
> of pages in Postgres seems to make ARC or CAR a good fit. Also, I
> suspect that we can afford to use a somewhat less scalable approach,
> since it doesn't matter how fast buffer eviction is if its just
> spinning its wheels, and because in recent years we've mitigated some
> of the bottlenecks.
>
> In a way, it's not at all surprising that we can end up with
> usage_counts of 5 all around when shared_buffers is large. Blocks end
> up "resting on their laurels"; they're impossible to evict based on a
> large amount of blind luck, such as being involved in a burst of
> activity some time ago (so much for clocksweep approximating an LRU;
> this is actually more like an LFU). ARC/CAR moves a "recent" block to
> the "frequent" list if the block gets referenced a second time after
> an interval that is, in effect, self-correcting (it may or may not be
> from the ghost recent list or the real buffer cache recent list --
> either way it goes to the head of the frequent list). Moving a block
> from the recent to the frequent list also enlarges the recent list
> target size at the expense of the frequent list target size. "This
> block in particular deserves to be a frequent block, but as a
> consequence all blocks should collectively be considered slightly less
> deserving of being frequent blocks."
>
> Importantly, this allows blocks to have their extra utility recognized
> by getting moved to the frequent list, but specifically tracks whether
> or not the system fails to appreciate recent blocks that deserve to be
> promoted to frequent blocks, but cannot get a break because they get
> evicted from the recent list too quickly (including the ghost recent
> list). Because the recent ghost list and frequent ghost list are fixed
> in size (only the real lists/head lists get resized), we won't forget
> too much about recency or frequency. And, we won't go overboard with
> "writing off frequent list blocks as has-been dead weight", because we
> notice that we've gone too far, and measure that in terms of would-be
> cache hits that ended up as actual cache misses. Most important of
> all, we won't "get stuck", as we see with clocksweep today while still
> having something that's stable for stable workloads.
>
> > Every replacement strategy which doesn't care about the kind of block
> > that's being worked with will end up performing random replacement under
> > enough pressure. We don't actually have to allow that to happen though
> > since we do have information about the kind of block and therefore it
> > seems like we should be able to keep things mediocre, and avoid letting
> > things get bad under such pressure.
>
> If you have extreme cache pressure, I don't think that there is
> anything left to worry about other than the scalability of evicting
> buffers. That should be a rare enough occurrence for well maintained
> systems these days (again, this may only actually be true as an
> aspiration). Just look at the ARC paper's statistics; they report that
> they're not much better than the competition (including even a classic
> LRU) when there is either mild cache pressure or extreme cache
> pressure. The middle ground is the interesting part.
>
> All of that having been said, maybe we have an independent low-level
> problem: we increase usage_count when we pin a buffer, even if we last
> pinned the buffer 5ms ago. IIRC a change to this went in when ARC went
> in (and came out with ARC, too). This is more of a problem with
> miscounting the number of hits based on accidents around how things
> are structured in the executor -- we need to be more careful about
> recognizing whether or not block hits are logically distinct.
> Particularly with stuff like nested loop joins.
>

I've developed an algorithm as improvement of GClock:
- increment of usage count is not on every access, but on first access
during some span of clock arm movements:
-- new item is put with count 0, and it is not incremented until clock arm
will move at least 1/32 of whole round.
-- if arm is moved further, then usage count is incremented on next access,
arm position is remembered. During next 1/32 of whole round usage count is
not incremented.
- usage count is incremented not by 1, but by 8. It simulates movement to
"frequent" list of ARC/CAR/2Q,
-- there is a limit of usage count - 32, so there is four simulared
frequency queues,
- to find item for eviction, items are scanned in batch of 25:
-- if items usage count is zero, it is evicted, and scan stops
-- otherwise usage count is decremented as in GClock,
-- item with least usage count seen so far is remembered
-- if batch is over, then item with least usage count is evicted.

I've tested this algorithm with some artificial traces. It is much better
than LRU, but usually worser than ARC and CLOCK+ (btw, why no one mention
CLOCK+ in this thread).

I didn't implement it in PostgreSQL yet (I did implement batch scanning
once, but other algorithm details were not developed then).

And I don't know, if it will be better for PostgreSQL really.
And I believe, CLOCK+ will perform better, being implemented correctly, but
modified a bit to use array instead of list (CLOCK+ originally uses least
with ability to insert and remove from any position).

So you see: there is a lot of algorithms, and new could be suggested as we
go. It is not easy to choose one to concentrate our attention.

That is why I propose to collect some realworld traces, and test algorithms
with them before rushing to implementation.

With regards,
Sokolov Yura.

>

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2018-04-29 19:48:27 Re: [HACKERS] Clock with Adaptive Replacement
Previous Message David Fetter 2018-04-29 19:36:59 Re: Implementing SQL ASSERTION