Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-11-07 21:34:54
Message-ID: b8ab7410-0b96-495f-88e6-6b41db39d290@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/7/24 18:55, Peter Geoghegan wrote:
> On Thu, Nov 7, 2024 at 10:03 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> The primary reason why I kept amgettuple() as is, and added a new AM
>> callback for the "batch" mode is backwards compatibility. I did not want
>> to force all AMs to do this, I think it should be optional. Not only to
>> limit the disruption for out-of-core AMs, but also because I'm not 100%
>> sure every AM will be able to do batching in a reasonable way.
>
> All index AMs that implement amgettuple are fairly similar to nbtree. They are:
>
> * nbtree itself
> * GiST
> * Hash
> * SP-GiST
>
> They all have the same general notion of page-at-a-time processing,
> with buffering of items for the amgettuple callback to return. There
> are perhaps enough differences to be annoying in SP-GiST, and with
> GiST's ordered scans (which use a pairing heap rather than true
> page-at-a-time processing). I guess you're right that you'll need to
> maintain amgettuple support for the foreseeable future, to support
> these special cases.
>
> I still think that you shouldn't need to use amgettuple in either
> nbtree or hash, since neither AM does anything non-generic in this
> area. It should be normal to never need to use amgettuple.
>

Right, I can imagine not using amgettuple() in nbtree/hash. I guess we
could even remove it altogether, although I'm not sure that'd work right
now (haven't tried).

>> Yes, we could ditch the batching in indexam.c, and just rely on the AM
>> batching, just like now.
>
> To be clear, I had imagined completely extracting the batching from
> the index AM, since it isn't really at all coupled to individual index
> AM implementation details anyway. I don't hate the idea of doing more
> in the index AM, but whether or not it happens there vs. somewhere
> else isn't my main concern at this point.
>
> My main concern right now is that one single place be made to see
> every relevant piece of information about costs and benefits. Probably
> something inside indexam.c.
>

Not sure I understand, but I think I'm somewhat confused by "index AM"
vs. indexam. Are you suggesting the individual index AMs should know as
little about the batching as possible, and instead it should be up to
indexam.c to orchestrate most of the stuff?

If yes, then I agree in principle, and I think indexam.c is the right
place to do that (or at least I can't think of a better one).

That's what the current patch aimed to do, more or less. I'm not saying
it got it perfectly right, and I'm sure there is stuff that can be
improved (like reworking _steppage to not deal with killed tuples). But
surely the index AMs need to have some knowledge about batching, because
how else would it know which leaf pages to still keep, etc?

>> There are a couple details why the separate
>> batching seemed convenient:
>>
>> 1) We may need to stash some custom data for each TID (e.g. so that IOS
>> does not need to check VM repeatedly). But perhaps that could be
>> delegated to the index AM too ...
>>
>> 2) We need to maintain two "positions" in the index. One for the item
>> the executor is currently processing (and which might end up getting
>> marked as "killed" etc). And another one for "read" position, i.e. items
>> passed to the read stream API / prefetching, etc.
>
> That all makes sense.
>

OK

>> 3) It makes it clear when the items are no longer needed, and the AM can
>> do cleanup. process kill tuples, etc.
>
> But it doesn't, really. The index AM is still subject to exactly the
> same constraints in terms of page-at-a-time processing. These existing
> constraints always came from the table AM side, so it's not as if your
> patch can remain totally neutral on these questions.
>

Not sure I understand. Which part of my sentence you disagree with? Or
what constraints you mean?

The interface does not require page-at-a-time processing - the index AM
is perfectly within it's rights to produce a batch spanning 10 leaf
pages, as long as it keeps track of them, and perhaps keeps some mapping
of items (returned in the batch) to leaf pages. So that when the next
batch is requested, it can do the cleanup, and move to the next batch.

Yes, the current implementation does not do that, to keep the patches
simple. But it should be possible, I believe.

> Basically, it looks like you've invented a shadow batching interface
> that is technically not known to the index AM, but nevertheless
> coordinates with the existing so->currPos batching interface.
>

Perhaps, but which part of that you consider a problem? Are you saying
this shouldn't use the currPos stuff at all, and instead do stuff in
some other way?

>> I don't think indexam.c knows all that much about the nbtree internal
>> batching. It "just" relies on amgetbatch() producing items the AM can
>> handle later (during killtuples/cleanup etc.). It does not even need to
>> be a single-leaf-page batch, if the AM knows how to track/deal with that
>> internally.
>
> I'm concerned that no single place will know about everything under
> this scheme. Having one single place that has visibility into all
> relevant costs, whether they're index AM or table AM related, is what
> I think you should be aiming for.
>
> I think that you should be removing the parts of the nbtree (and other
> index AM) code that deal with the progress of the scan explicitly.
> What remains is code that simply reads the next page, and saves its
> details in the relevant data structures. Or code that "finishes off" a
> leaf page by dropping its pin, and maybe doing the _bt_killitems
> stuff.

Does that mean not having a simple amgetbatch() callback, but some finer
grained interface? Or maybe one callback that returns the next "AM page"
(essentially the currPos), and then another callback to release it?

(This is what I mean by "two-callback API" later.)

Or what would it look like?

> The index AM itself should no longer know about the current next tuple
> to return, nor about mark/restore. It is no longer directly in control
> of the scan's progress. It loses all context that survives across API
> calls.
>

I'm lost. How could the index AM not know about mark/restore?

>> Yes, splitting _bt_steppage() like this makes sense to me, and I agree
>> being able to proceed to the next page before we're done with the
>> current page seems perfectly reasonable for batches spanning multiple
>> leaf pages.
>
> I think that it's entirely possible that it'll just be easier to do
> things this way from the start. I understand that that may be far from
> obvious right now, but, again, I just don't see what's so special
> about the way that each index AM batches results. What about that it
> is so hard to generalize across index AMs that must support amgettuple
> right now? (At least in the case of nbtree and hash, which have no
> special requirements for things like KNN-GiST.)
>

I don't think the batching in various AMs is particularly unique, that's
true. But my goal was to wrap that in a single amgetbatch callback,
because that seemed natural, and that moves some of the responsibilities
to the AM. I still don't quite understand what API you imagine, but if
we want to make more of this the responsibility of indexam.c, I guess it
will require multiple smaller callbacks (I'm not opposed to that, but I
also don't know if that's what you imagine).

> Most individual calls to btgettuple just return the next batched-up
> so->currPos tuple/TID via another call to _bt_next. Things like the
> _bt_first-new-primitive-scan case don't really add any complexity --
> the core concept of processing a page at a time still applies. It
> really is just a simple batching scheme, with a couple of extra fiddly
> details attached to it -- but nothing too hairy.
>

True, although the details (how the batches are represented etc.) are
often quite different, so did you imagine some shared structure to
represent that, or wrapping that in a new callback? Or how would
indexam.c work with that?

> The hardest part will probably be rigorously describing the rules for
> not breaking index-only scans due to concurrent TID recycling by
> VACUUM, and the rules for doing _bt_killitems. But that's also not a
> huge problem, in the grand scheme of things.
>

It probably is not a huge problem ... for someone who's already familiar
with the rules, at least intuitively. But TBH this part really scares me
a little bit.

>> Hmm. I've intentionally tried to ignore these issues, or rather to limit
>> the scope of the patch so that v1 does not require dealing with it.
>> Hence the restriction to single-leaf batches, for example.
>>
>> But I guess I may have to look at this after all ... not great.
>
> To be clear, I don't think that you necessarily have to apply these
> capabilities in v1 of this project. I would be satisfied if the patch
> could just break things out in the right way, so that some later patch
> could improve things later on. I only really want to see the
> capabilities within the index AM decomposed, such that one central
> place can see a global view of the costs and benefits of the index
> scan.
>

Yes, I understand that. Getting the overall design right is my main
concern, even if some of the advanced stuff is not implemented until
later. But with the wrong design, that may turn out to be difficult.

That's the feedback I was hoping for when I kept bugging you, and this
discussion was already very useful in this regard. Thank you for that.

> You should be able to validate the new API by stress-testing the code.
> You can make the index AM read several leaf pages at a time when a
> certain debug mode is enabled. Once you prove that the index AM
> correctly performs the same processing as today correctly, without any
> needless restrictions on the ordering that these decomposed operators
> perform (only required restrictions that are well explained and
> formalized), then things should be on the right path.
>

Yeah, stress testing is my primary tool ...

>>> ISTM that the central feature of the new API should be the ability to
>>> reorder certain kinds of work. There will have to be certain
>>> constraints, of course. Sometimes these will principally be problems
>>> for the table AM (e.g., we musn't allow concurrent TID recycling
>>> unless it's for a plain index scan using an MVCC snapshot), other
>>> times they're principally problems for the index AM (e.g., the
>>> _bt_killitems safety issues).
>>>
>>
>> Not sure. By "new API" you mean the read stream API, or the index AM API
>> to allow batching?
>
> Right now those two concepts seem incredibly blurred to me.
>

Same here.

>>> I get that you're not that excited about multi-page batches; it's not
>>> the priority. Fair enough. I just think that the API needs to work in
>>> terms of batches that are sized as one or more pages, in order for it
>>> to make sense.
>>>
>>
>> True, but isn't that already the case? I mean, what exactly prevents an
>> index AM to "build" a batch for multiple leaf pages? The current patch
>> does not implement that for any of the AMs, true, but isn't that already
>> possible if the AM chooses to?
>
> That's unclear, but overall I'd say no.
>
> The index AM API says that they need to hold on to a buffer pin to
> avoid confusing scans due to concurrent TID recycling by VACUUM. The
> index AM API fails to adequately describe what is expected here. And
> it provides no useful context for larger batching of index pages.
> nbtree already does its own thing by dropping leaf page pins
> selectively.
>

Not sure I understand. I imagined the index AM would just read a
sequence of leaf pages, keeping all the same pins etc. just like it does
for the one leaf it reads right now (pins, etc.).

I'm probably too dumb for that, but I still don't quite understand how
that's different from just reading and processing that sequence of leaf
pages by amgettuple without batching.

> Whether or not it's technically possible is a matter of interpretation
> (I came down on the "no" side, but it's still ambiguous). I would
> prefer it if the index AM API was much simpler for ordered scans. As I
> said already, something along the lines of "when you're told to scan
> the next index page, here's how we'll call you, here's the data
> structure that you need to fill up". Or "when we tell you that we're
> done fetching tuples from a recently read index page, here's how we'll
> call you".
>

I think this is pretty much "two-callback API" I mentioned earlier.

> These discussions about where the exact boundaries lie don't seem very
> helpful. The simple fact is that nobody is ever going to invent an
> index AM side interface that batches up more than a single leaf page.
> Why would they? It just doesn't make sense to, since the index AM has
> no idea about certain clearly-relevant context. For example, it has no
> idea whether or not there's a LIMIT involved.
>
> The value that comes from using larger batches on the index AM side
> comes from making life easier for heap prefetching, which index AMs
> know nothing about whatsoever. Again, the goal should be to marry
> information from the index AM and the table AM in one central place.
>

True, although the necessary context could be passed to the index AM in
some way. That's what happens in the current patch, where indexam.c
could size the batch just right for a LIMIT clause, before asking the
index AM to fill it with items.

>> Unfortunately that didn't work because of killtuples etc. because the
>> index AM had no idea about the indexam queue and has it's own concept of
>> "current item", so it was confused about which item to mark as killed.
>> And that old item might even be from an earlier leaf page (not the
>> "current" currPos).
>
> Currently, during a call to btgettuple, so->currPos.itemIndex is
> updated within _bt_next. But before _bt_next is called,
> so->currPos.itemIndex indicates the item returned by the most recent
> prior call to btgettuple -- which is also the tuple that the
> scan->kill_prior_tuple reports on. In short, btgettuple does some
> trivial things to remember which entries from so->currPos ought to be
> marked dead later on due to the scan->kill_prior_tuple flag having
> been set for those entries. This can be moved outside of each index
> AM.
>
> The index AM shouldn't need to use a scan->kill_prior_tuple style flag
> under the new batching API at all, though. It should work at a higher
> level than that. The index AM should be called through a callback that
> tells it to drop the pin on a page that the table AM has been reading
> from, and maybe perform _bt_killitems on these relevant known-dead
> TIDs first. In short, all of the bookkeeping for so->killedItems[]
> should be happening at a completely different layer. And the
> so->killedItems[] structure should be directly associated with a
> single index page subset of a batch (a subset similar to the current
> so->currPos batches).
>
> The first time the index AM sees anything about dead TIDs, it should
> see a whole leaf page worth of them.
>

I need to think about this a bit, but I agree passing this information
to an index AM through the kill_prior_tuple seems weird.

>> I was thinking maybe the AM could keep the leaf pages, and then free
>> them once they're no longer needed. But it wasn't clear to me how to
>> exchange this information between indexam.c and the index AM, because
>> right now the AM only knows about a single (current) position.
>
> I'm imagining a world in which the index AM doesn't even know about
> the current position. Basically, it has no real context about the
> progress of the scan to maintain at all. It merely does what it is
> told by some higher level, that is sensitive to the requirements of
> both the index AM and the table AM.
>

Hmmm, OK. If the idea is to just return a leaf page as an array of items
(in some fancy way) to indexam.c, then it'd be indexam.c responsible for
tracking what the current position (or multiple positions are), I guess.

>> But imagine we have this:
>>
>> a) A way to switch the scan into "batch" mode, where the AM keeps the
>> leaf page (and a way for the AM to indicate it supports this).
>
> I don't think that there needs to be a batch mode. There could simply
> be the total absence of batching, which is one point along a
> continuum, rather than a discrete mode.
>
>> b) Some way to track two "positions" in the scan - one for read, one for
>> prefetch. I'm not sure if this would be internal in each index AM, or at
>> the indexam.c level.
>
> I think that it would be at the indexam.c level.
>

Yes, if the index AM returns page as a set of items, then it'd be up to
indexam.c to maintain all this information.

>> c) A way to get the index tuple for either of the two positions (and
>> advance the position). It might be a flag for amgettuple(), or maybe
>> even a callaback for the "prefetch" position.
>
> Why does the index AM need to know anything about the fact that the
> next tuple has been requested? Why can't it just be 100% ignorant of
> all that? (Perhaps barring a few special cases, such as KNN-GiST
> scans, which continue to use the legacy amgettuple interface.)
>

Well, I was thinking about how it works now, for the "current" position.
And I was thinking about how would it need to change to handle the
prefetch position too, in the same way ...

But if you're suggesting to move this logic and context to the upper
layer indexam.c, that changes things ofc.

>> d) A way to inform the AM items up to some position are no longer
>> needed, and thus the leaf pages can be cleaned up and freed. AFAICS it
>> could always be "up to the current read position".
>
> Yeah, I like this idea. But the index AM doesn't need to know about
> positions and whatnot. It just needs to do what it's told: to drop the
> pin, and maybe to perform _bt_killitems first. Or maybe just to drop
> the pin, with instruction to do _bt_killitems coming some time later
> (the index AM will need to be a bit more careful within its
> _bt_killitems step when this happens).
>

Well, if the AM works with "batches of tuples for a leaf page" (through
the two callbacks to read / release a page), then positions to exact
items are no longer needed. It just needs to know which pages are still
needed, etc. Correct?

> The index AM doesn't need to drop the current pin for the current
> position -- not as such. The index AM doesn't directly know about what
> pins are held, since that'll all be tracked elsewhere. Again, the
> index AM should need to hold onto zero context, beyond the immediate
> request to perform one additional unit of work, which will
> usually/always happen at the index page level (all of which is tracked
> by data structures that are under the control of the new indexam.c
> level).
>

No idea.

> I don't think that it'll ultimately be all that hard to schedule when
> and how index pages are read from outside of the index AM in question.
> In general all relevant index AMs already work in much the same way
> here. Maybe we can ultimately invent a way for the index AM to
> influence that scheduling, but that might never be required.
>

I haven't thought about scheduling at all. Maybe there's something we
could improve in the future, but I don't see what would it look like,
and it seems unrelated to this patch.

>> Does that sound reasonable / better than the current approach, or have I
>> finally reached the "raving lunatic" stage?
>
> The stage after "raving lunatic" is enlightenment. :-)
>

That's my hope.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-11-07 21:36:38 Re: Use __attribute__((target(sse4.2))) for SSE42 CRC32C
Previous Message Tom Lane 2024-11-07 21:32:29 Re: magical eref alias names