Re: index prefetching

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-02-14 21:45:57
Message-ID: CAAKRu_ZKp1BCT+V324jENxKTfsetxJwxh309rJGWxebSggPisw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Feb 14, 2024 at 1:21 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Wed, Feb 14, 2024 at 8:34 AM Tomas Vondra
> <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> > > Another thing that argues against doing this is that we might not need
> > > to visit any more B-Tree leaf pages when there is a LIMIT n involved.
> > > We could end up scanning a whole extra leaf page (including all of its
> > > tuples) for want of the ability to "push down" a LIMIT to the index AM
> > > (that's not what happens right now, but it isn't really needed at all
> > > right now).
> > >
> >
> > I'm not quite sure I understand what is "this" that you argue against.
> > Are you saying we should not separate the two scans? If yes, is there a
> > better way to do this?
>
> What I'm concerned about is the difficulty and complexity of any
> design that requires revising "63.4. Index Locking Considerations",
> since that's pretty subtle stuff. In particular, if prefetching
> "de-synchronizes" (to use your term) the index leaf page level scan
> and the heap page scan, then we'll probably have to totally revise the
> basic API.

So, a pin on the index leaf page is sufficient to keep line pointers
from being reused? If we stick to prefetching heap blocks referred to
by index tuples in a single index leaf page, and we keep that page
pinned, will we still have a problem?

> > The LIMIT problem is not very clear to me either. Yes, if we get close
> > to the end of the leaf page, we may need to visit the next leaf page.
> > But that's kinda the whole point of prefetching - reading stuff ahead,
> > and reading too far ahead is an inherent risk. Isn't that a problem we
> > have even without LIMIT? The prefetch distance ramp up is meant to limit
> > the impact.
>
> Right now, the index AM doesn't know anything about LIMIT at all. That
> doesn't matter, since the index AM can only read/scan one full leaf
> page before returning control back to the executor proper. The
> executor proper can just shut down the whole index scan upon finding
> that we've already returned N tuples for a LIMIT N.
>
> We don't do prefetching right now, but we also don't risk reading a
> leaf page that'll just never be needed. Those two things are in
> tension, but I don't think that that's quite the same thing as the
> usual standard prefetching tension/problem. Here there is uncertainty
> about whether what we're prefetching will *ever* be required -- not
> uncertainty about when exactly it'll be required. (Perhaps this
> distinction doesn't mean much to you. I'm just telling you how I think
> about it, in case it helps move the discussion forward.)

I don't think that the LIMIT problem is too different for index scans
than heap scans. We will need some advice from planner to come down to
prevent over-eager prefetching in all cases.

> Another factor that complicates things here is mark/restore
> processing. The design for that has the idea of processing one page at
> a time baked-in. Kinda like with the kill_prior_tuple issue.

Yes, I mentioned this in my earlier email. I think we can resolve
mark/restore by resetting the prefetch and TID queues and restoring
the last used heap TID in the index scan descriptor.

> It's certainly possible that you could figure out various workarounds
> for each of these issues (plus the kill_prior_tuple issue) with a
> prefetching design that "de-synchronizes" the index access and the
> heap access. But it might well be better to extend the existing design
> in a way that just avoids all these problems in the first place. Maybe
> "de-synchronization" really can pay for itself (because the benefits
> will outweigh these costs), but if you go that way then I'd really
> prefer it that way.

Forcing each index access to be synchronous and interleaved with each
table access seems like an unprincipled design constraint. While it is
true that we rely on that in our current implementation (when using
non-MVCC snapshots), it doesn't seem like a principle inherent to
accessing indexes and tables.

> > > I think that it makes sense to put the index AM in control here --
> > > that almost follows from what I said about the index AM API. The index
> > > AM already needs to be in control, in about the same way, to deal with
> > > kill_prior_tuple (plus it helps with the LIMIT issue I described).
> > >
> >
> > In control how? What would be the control flow - what part would be
> > managed by the index AM?
>
> ISTM that prefetching for an index scan is about the index scan
> itself, first and foremost. The heap accesses are usually the dominant
> cost, of course, but sometimes the index leaf page accesses really do
> make up a significant fraction of the overall cost of the index scan.
> Especially with an expensive index qual. So if you just assume that
> the TIDs returned by the index scan are the only thing that matters,
> you might have a model that's basically correct on average, but is
> occasionally very wrong. That's one reason for "putting the index AM
> in control".

I don't think the fact that it would also be valuable to do index
prefetching is a reason not to do prefetching of heap pages. And,
while it is true that were you to add index interior or leaf page
prefetching, it would impact the heap prefetching, at the end of the
day, the table AM needs some TID or TID-equivalents that whose blocks
it can go fetch. The index AM has to produce something that the table
AM will consume. So, if we add prefetching of heap pages and get the
table AM input right, it shouldn't require a full redesign to add
index page prefetching later.

You could argue that my suggestion to have the index AM manage and
populate a queue of TIDs for use by the table AM puts the index AM in
control. I do think having so many members of the IndexScanDescriptor
which imply a one-at-a-time (xs_heaptid, xs_itup, etc) synchronous
interplay between fetching an index tuple and fetching a heap tuple is
confusing and error prone.

> As I said back in June, we should probably be marrying information
> from the index scan with information from the heap. This is something
> that is arguably a modularity violation. But it might just be that you
> really do need to take information from both places to consistently
> make the right trade-off.

Agreed that we are going to need to mix information from both places.

> If you take a look at _bt_killitems(), you'll see that it actually has
> two fairly different strategies for avoiding TID recycling race
> condition issues, applied in each of two different cases:
>
> 1. Cases where we really have held onto a buffer pin, per the index AM
> API -- the "inde AM orthodox" approach. (The aforementioned issue
> with unlogged indexes exists because with an unlogged index we must
> use approach 1, per the nbtree README section [1]).
>
> 2. Cases where we drop the pin as an optimization (also per [1]), and
> now have to detect the possibility of concurrent modifications by
> VACUUM (that could have led to concurrent TID recycling). We
> conservatively do nothing (don't mark any index tuples LP_DEAD),
> unless the LSN is exactly the same as it was back when the page was
> scanned/read by _bt_readpage().

Re 2: so the LSN could have been changed by some other process (i.e.
not vacuum), so how often in practice is the LSN actually the same as
when the page was scanned/read? Do you think we would catch a
meaningful number of kill prior tuple opportunities if we used an LSN
tracking method like this? Something that let us drop the pin on the
page would obviously be better.

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-02-14 22:06:35 Re: Add trim_trailing_whitespace to editorconfig file
Previous Message Andres Freund 2024-02-14 21:37:11 Re: Refactoring backend fork+exec code