Re: index prefetching

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-11-07 17:55:14
Message-ID: CAH2-Wzn7vqmt=qE_hDrOx4NETkUoCbdn74G1gswMXi1APUuYrA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 7, 2024 at 10:03 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> The primary reason why I kept amgettuple() as is, and added a new AM
> callback for the "batch" mode is backwards compatibility. I did not want
> to force all AMs to do this, I think it should be optional. Not only to
> limit the disruption for out-of-core AMs, but also because I'm not 100%
> sure every AM will be able to do batching in a reasonable way.

All index AMs that implement amgettuple are fairly similar to nbtree. They are:

* nbtree itself
* GiST
* Hash
* SP-GiST

They all have the same general notion of page-at-a-time processing,
with buffering of items for the amgettuple callback to return. There
are perhaps enough differences to be annoying in SP-GiST, and with
GiST's ordered scans (which use a pairing heap rather than true
page-at-a-time processing). I guess you're right that you'll need to
maintain amgettuple support for the foreseeable future, to support
these special cases.

I still think that you shouldn't need to use amgettuple in either
nbtree or hash, since neither AM does anything non-generic in this
area. It should be normal to never need to use amgettuple.

> Yes, we could ditch the batching in indexam.c, and just rely on the AM
> batching, just like now.

To be clear, I had imagined completely extracting the batching from
the index AM, since it isn't really at all coupled to individual index
AM implementation details anyway. I don't hate the idea of doing more
in the index AM, but whether or not it happens there vs. somewhere
else isn't my main concern at this point.

My main concern right now is that one single place be made to see
every relevant piece of information about costs and benefits. Probably
something inside indexam.c.

> There are a couple details why the separate
> batching seemed convenient:
>
> 1) We may need to stash some custom data for each TID (e.g. so that IOS
> does not need to check VM repeatedly). But perhaps that could be
> delegated to the index AM too ...
>
> 2) We need to maintain two "positions" in the index. One for the item
> the executor is currently processing (and which might end up getting
> marked as "killed" etc). And another one for "read" position, i.e. items
> passed to the read stream API / prefetching, etc.

That all makes sense.

> 3) It makes it clear when the items are no longer needed, and the AM can
> do cleanup. process kill tuples, etc.

But it doesn't, really. The index AM is still subject to exactly the
same constraints in terms of page-at-a-time processing. These existing
constraints always came from the table AM side, so it's not as if your
patch can remain totally neutral on these questions.

Basically, it looks like you've invented a shadow batching interface
that is technically not known to the index AM, but nevertheless
coordinates with the existing so->currPos batching interface.

> I don't think indexam.c knows all that much about the nbtree internal
> batching. It "just" relies on amgetbatch() producing items the AM can
> handle later (during killtuples/cleanup etc.). It does not even need to
> be a single-leaf-page batch, if the AM knows how to track/deal with that
> internally.

I'm concerned that no single place will know about everything under
this scheme. Having one single place that has visibility into all
relevant costs, whether they're index AM or table AM related, is what
I think you should be aiming for.

I think that you should be removing the parts of the nbtree (and other
index AM) code that deal with the progress of the scan explicitly.
What remains is code that simply reads the next page, and saves its
details in the relevant data structures. Or code that "finishes off" a
leaf page by dropping its pin, and maybe doing the _bt_killitems
stuff.

The index AM itself should no longer know about the current next tuple
to return, nor about mark/restore. It is no longer directly in control
of the scan's progress. It loses all context that survives across API
calls.

> Yes, splitting _bt_steppage() like this makes sense to me, and I agree
> being able to proceed to the next page before we're done with the
> current page seems perfectly reasonable for batches spanning multiple
> leaf pages.

I think that it's entirely possible that it'll just be easier to do
things this way from the start. I understand that that may be far from
obvious right now, but, again, I just don't see what's so special
about the way that each index AM batches results. What about that it
is so hard to generalize across index AMs that must support amgettuple
right now? (At least in the case of nbtree and hash, which have no
special requirements for things like KNN-GiST.)

Most individual calls to btgettuple just return the next batched-up
so->currPos tuple/TID via another call to _bt_next. Things like the
_bt_first-new-primitive-scan case don't really add any complexity --
the core concept of processing a page at a time still applies. It
really is just a simple batching scheme, with a couple of extra fiddly
details attached to it -- but nothing too hairy.

The hardest part will probably be rigorously describing the rules for
not breaking index-only scans due to concurrent TID recycling by
VACUUM, and the rules for doing _bt_killitems. But that's also not a
huge problem, in the grand scheme of things.

> Hmm. I've intentionally tried to ignore these issues, or rather to limit
> the scope of the patch so that v1 does not require dealing with it.
> Hence the restriction to single-leaf batches, for example.
>
> But I guess I may have to look at this after all ... not great.

To be clear, I don't think that you necessarily have to apply these
capabilities in v1 of this project. I would be satisfied if the patch
could just break things out in the right way, so that some later patch
could improve things later on. I only really want to see the
capabilities within the index AM decomposed, such that one central
place can see a global view of the costs and benefits of the index
scan.

You should be able to validate the new API by stress-testing the code.
You can make the index AM read several leaf pages at a time when a
certain debug mode is enabled. Once you prove that the index AM
correctly performs the same processing as today correctly, without any
needless restrictions on the ordering that these decomposed operators
perform (only required restrictions that are well explained and
formalized), then things should be on the right path.

> > ISTM that the central feature of the new API should be the ability to
> > reorder certain kinds of work. There will have to be certain
> > constraints, of course. Sometimes these will principally be problems
> > for the table AM (e.g., we musn't allow concurrent TID recycling
> > unless it's for a plain index scan using an MVCC snapshot), other
> > times they're principally problems for the index AM (e.g., the
> > _bt_killitems safety issues).
> >
>
> Not sure. By "new API" you mean the read stream API, or the index AM API
> to allow batching?

Right now those two concepts seem incredibly blurred to me.

> > I get that you're not that excited about multi-page batches; it's not
> > the priority. Fair enough. I just think that the API needs to work in
> > terms of batches that are sized as one or more pages, in order for it
> > to make sense.
> >
>
> True, but isn't that already the case? I mean, what exactly prevents an
> index AM to "build" a batch for multiple leaf pages? The current patch
> does not implement that for any of the AMs, true, but isn't that already
> possible if the AM chooses to?

That's unclear, but overall I'd say no.

The index AM API says that they need to hold on to a buffer pin to
avoid confusing scans due to concurrent TID recycling by VACUUM. The
index AM API fails to adequately describe what is expected here. And
it provides no useful context for larger batching of index pages.
nbtree already does its own thing by dropping leaf page pins
selectively.

Whether or not it's technically possible is a matter of interpretation
(I came down on the "no" side, but it's still ambiguous). I would
prefer it if the index AM API was much simpler for ordered scans. As I
said already, something along the lines of "when you're told to scan
the next index page, here's how we'll call you, here's the data
structure that you need to fill up". Or "when we tell you that we're
done fetching tuples from a recently read index page, here's how we'll
call you".

These discussions about where the exact boundaries lie don't seem very
helpful. The simple fact is that nobody is ever going to invent an
index AM side interface that batches up more than a single leaf page.
Why would they? It just doesn't make sense to, since the index AM has
no idea about certain clearly-relevant context. For example, it has no
idea whether or not there's a LIMIT involved.

The value that comes from using larger batches on the index AM side
comes from making life easier for heap prefetching, which index AMs
know nothing about whatsoever. Again, the goal should be to marry
information from the index AM and the table AM in one central place.

> Unfortunately that didn't work because of killtuples etc. because the
> index AM had no idea about the indexam queue and has it's own concept of
> "current item", so it was confused about which item to mark as killed.
> And that old item might even be from an earlier leaf page (not the
> "current" currPos).

Currently, during a call to btgettuple, so->currPos.itemIndex is
updated within _bt_next. But before _bt_next is called,
so->currPos.itemIndex indicates the item returned by the most recent
prior call to btgettuple -- which is also the tuple that the
scan->kill_prior_tuple reports on. In short, btgettuple does some
trivial things to remember which entries from so->currPos ought to be
marked dead later on due to the scan->kill_prior_tuple flag having
been set for those entries. This can be moved outside of each index
AM.

The index AM shouldn't need to use a scan->kill_prior_tuple style flag
under the new batching API at all, though. It should work at a higher
level than that. The index AM should be called through a callback that
tells it to drop the pin on a page that the table AM has been reading
from, and maybe perform _bt_killitems on these relevant known-dead
TIDs first. In short, all of the bookkeeping for so->killedItems[]
should be happening at a completely different layer. And the
so->killedItems[] structure should be directly associated with a
single index page subset of a batch (a subset similar to the current
so->currPos batches).

The first time the index AM sees anything about dead TIDs, it should
see a whole leaf page worth of them.

> I was thinking maybe the AM could keep the leaf pages, and then free
> them once they're no longer needed. But it wasn't clear to me how to
> exchange this information between indexam.c and the index AM, because
> right now the AM only knows about a single (current) position.

I'm imagining a world in which the index AM doesn't even know about
the current position. Basically, it has no real context about the
progress of the scan to maintain at all. It merely does what it is
told by some higher level, that is sensitive to the requirements of
both the index AM and the table AM.

> But imagine we have this:
>
> a) A way to switch the scan into "batch" mode, where the AM keeps the
> leaf page (and a way for the AM to indicate it supports this).

I don't think that there needs to be a batch mode. There could simply
be the total absence of batching, which is one point along a
continuum, rather than a discrete mode.

> b) Some way to track two "positions" in the scan - one for read, one for
> prefetch. I'm not sure if this would be internal in each index AM, or at
> the indexam.c level.

I think that it would be at the indexam.c level.

> c) A way to get the index tuple for either of the two positions (and
> advance the position). It might be a flag for amgettuple(), or maybe
> even a callaback for the "prefetch" position.

Why does the index AM need to know anything about the fact that the
next tuple has been requested? Why can't it just be 100% ignorant of
all that? (Perhaps barring a few special cases, such as KNN-GiST
scans, which continue to use the legacy amgettuple interface.)

> d) A way to inform the AM items up to some position are no longer
> needed, and thus the leaf pages can be cleaned up and freed. AFAICS it
> could always be "up to the current read position".

Yeah, I like this idea. But the index AM doesn't need to know about
positions and whatnot. It just needs to do what it's told: to drop the
pin, and maybe to perform _bt_killitems first. Or maybe just to drop
the pin, with instruction to do _bt_killitems coming some time later
(the index AM will need to be a bit more careful within its
_bt_killitems step when this happens).

The index AM doesn't need to drop the current pin for the current
position -- not as such. The index AM doesn't directly know about what
pins are held, since that'll all be tracked elsewhere. Again, the
index AM should need to hold onto zero context, beyond the immediate
request to perform one additional unit of work, which will
usually/always happen at the index page level (all of which is tracked
by data structures that are under the control of the new indexam.c
level).

I don't think that it'll ultimately be all that hard to schedule when
and how index pages are read from outside of the index AM in question.
In general all relevant index AMs already work in much the same way
here. Maybe we can ultimately invent a way for the index AM to
influence that scheduling, but that might never be required.

> Does that sound reasonable / better than the current approach, or have I
> finally reached the "raving lunatic" stage?

The stage after "raving lunatic" is enlightenment. :-)

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2024-11-07 18:00:54 Re: Change COPY ... ON_ERROR ignore to ON_ERROR ignore_row
Previous Message Alvaro Herrera 2024-11-07 17:54:44 Re: doc fail about ALTER TABLE ATTACH re. NO INHERIT