Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-11-10 21:41:33
Message-ID: dc4c7ace-f3c4-4325-a63c-b1f9aa5541ba@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/8/24 02:35, Peter Geoghegan wrote:
> On Thu, Nov 7, 2024 at 4:34 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> Not sure I understand, but I think I'm somewhat confused by "index AM"
>> vs. indexam. Are you suggesting the individual index AMs should know as
>> little about the batching as possible, and instead it should be up to
>> indexam.c to orchestrate most of the stuff?
>
> Yes, that's what I'm saying. Knowing "as little as possible" turns out
> to be pretty close to knowing nothing at all.
>
> There might be some minor exceptions, such as the way that nbtree
> needs to remember the scan's array keys. But that already works in a
> way that's very insensitive to the exact position in the scan. For
> example, right now if you restore a mark that doesn't just come from
> the existing so->currPos batch then we cheat and reset the array keys.
>
>> If yes, then I agree in principle, and I think indexam.c is the right
>> place to do that (or at least I can't think of a better one).
>
> Good.
>
>> That's what the current patch aimed to do, more or less. I'm not saying
>> it got it perfectly right, and I'm sure there is stuff that can be
>> improved (like reworking _steppage to not deal with killed tuples). But
>> surely the index AMs need to have some knowledge about batching, because
>> how else would it know which leaf pages to still keep, etc?
>
> I think that your new thing can directly track which leaf pages have
> pins. As well as tracking the order that it has to return tuples from
> among those leaf page batch subsets.
>
> Your new thing can think about this in very general terms, that really
> aren't tied to any index AM specifics. It'll have some general notion
> of an ordered sequence of pages (in scan/key space order), each of
> which contains one or more tuples to return. It needs to track which
> pages have tuples that we've already done all the required visibility
> checks for, in order to be able to instruct the index AM to drop the
> pin.
>

Is it a good idea to make this part (in indexam.c) aware of /
responsible for managing stuff like pins? Perhaps it'd work fine for
index AMs that always return an array of items for a single leaf-page
(like btree or hash). But I'm still thinking about cases like gist with
ORDER BY clauses, or maybe something even weirder in custom AMs.

It seems to me knowing which pages may be pinned is very AM-specific
knowledge, and my intention was to let the AM to manage that. That is,
the new indexam code would be responsible for deciding when the "AM
batches" are loaded and released, using the two new callbacks. But it'd
be the AM responsible for making sure everything is released.

> Suppose, for example, that we're doing an SAOP index scan, where the
> leaf pages that our multi-page batch consists of aren't direct
> siblings. That literally doesn't matter at all. The pages still have
> to be in the same familiar key space/scan order, regardless. And that
> factor shouldn't really need to influence how many pins we're willing
> to hold on to (no more than it would when there are large numbers of
> index leaf pages with no interesting tuples to return that we must
> still scan over).
>

I agree that in the simple cases it's not difficult to determine what
pins we need for the sequence of tuples/pages. But is it guaranteed to
be that easy, and is it easy to communicate this information to the
indexam.c layer? I'm not sure about that. In an extreme case it may be
that each tuple comes from entirely different leaf page, and stuff like
that. And while most out-of-core AMs that I'm aware of are rather close
to nbtree/gist/gin, I wonder what weird things can be out there.

>>>> 3) It makes it clear when the items are no longer needed, and the AM can
>>>> do cleanup. process kill tuples, etc.
>>>
>>> But it doesn't, really. The index AM is still subject to exactly the
>>> same constraints in terms of page-at-a-time processing. These existing
>>> constraints always came from the table AM side, so it's not as if your
>>> patch can remain totally neutral on these questions.
>>>
>>
>> Not sure I understand. Which part of my sentence you disagree with? Or
>> what constraints you mean?
>
> What I was saying here was something I said more clearly a bit further
> down: it's technically possible to do multi-page batches within the
> confines of the current index AM API, but that's not true in any
> practical sense. And it'll never be true with an API that looks very
> much like the current amgettuple API.
>

OK

>> The interface does not require page-at-a-time processing - the index AM
>> is perfectly within it's rights to produce a batch spanning 10 leaf
>> pages, as long as it keeps track of them, and perhaps keeps some mapping
>> of items (returned in the batch) to leaf pages. So that when the next
>> batch is requested, it can do the cleanup, and move to the next batch.
>
> How does an index AM actually do that in a way that's useful? It only
> sees a small part of the picture. That's why it's the wrong place for
> it.
>

Sure, maybe it'd need some more information - say, how many items we
expect to read, but if indexam knows that bit, surely it can pass it
down to the AM.

But yeah, I agree doing it in amgettuple() would be inconvenient and
maybe even awkward. I can imagine the AM maintaining an array of
currPos, but then it'd also need to be made aware of multiple positions,
and stuff like that. Which it shouldn't need to know about.

>>> Basically, it looks like you've invented a shadow batching interface
>>> that is technically not known to the index AM, but nevertheless
>>> coordinates with the existing so->currPos batching interface.
>>>
>>
>> Perhaps, but which part of that you consider a problem? Are you saying
>> this shouldn't use the currPos stuff at all, and instead do stuff in
>> some other way?
>
> I think that you should generalize the currPos stuff, and move it to
> some other, higher level module.
>

By generalizing you mean defining a common struct serving the same
purpose, but for all the index AMs? And the new AM callbacks would
produce/consume this new struct, right?

>> Does that mean not having a simple amgetbatch() callback, but some finer
>> grained interface? Or maybe one callback that returns the next "AM page"
>> (essentially the currPos), and then another callback to release it?
>>
>> (This is what I mean by "two-callback API" later.)
>
> I'm not sure. Why does the index AM need to care about the batch size
> at all? It merely needs to read the next leaf page. The high level
> understanding of batches and the leaf pages that constitute batches
> lives elsewhere.
>

I don't think I suggested the index AM would need to know about the
batch size. Only indexam.c would be aware of that, and would read enough
stuff from the index to satisfy that.

> The nbtree code will know about buffer pins held, in the sense that
> it'll be the one setting the Buffer variables in the new scan
> descriptor thing. But it's not going to remember to drop those buffer
> pins on its own. It'll need to be told. So it's not ever really in
> control.
>

Right. So those pins would be released after indexam invokes the second
new callback, instructing the index AM to release everything associated
with a chunk of items returned sometime earlier.

>>> The index AM itself should no longer know about the current next tuple
>>> to return, nor about mark/restore. It is no longer directly in control
>>> of the scan's progress. It loses all context that survives across API
>>> calls.
>>>
>>
>> I'm lost. How could the index AM not know about mark/restore?
>
> Restoring a mark already works by restoring an earlier so->currPos
> batch. Actually, more often it works by storing an offset into the
> current so->currPos, without actually copying anything into
> so->markPos, and without restoring so->markPos into so->currPos.
>
> In short, there is virtually nothing about how mark/restore works that
> really needs to live inside nbtree. It's all just restoring an earlier
> batch and/or offset into a batch. The only minor caveat is the stuff
> about array keys that I went into already -- that isn't quite a piece
> of state that lives in so->currPos, but it's a little bit like that.
>
> You can probably poke one or two more minor holes in some of this --
> it's not 100% trivial. But it's doable.
>

OK. The thing that worries me is whether it's going to be this simple
for other AMs. Maybe it is, I don't know.

>> I don't think the batching in various AMs is particularly unique, that's
>> true. But my goal was to wrap that in a single amgetbatch callback,
>> because that seemed natural, and that moves some of the responsibilities
>> to the AM.
>
> Why is it natural? I mean all of the index AMs that support amgettuple
> copied everything from ntree already. Including all of the
> kill_prior_tuple stuff. It's already quite generic.
>

I don't recall my reasoning, and I'm not saying it was the right
instinct. But if we have one callback to read tuples, it seemed like
maybe we should have one callback to read a bunch of tuples in a similar
way.

>> I still don't quite understand what API you imagine, but if
>> we want to make more of this the responsibility of indexam.c, I guess it
>> will require multiple smaller callbacks (I'm not opposed to that, but I
>> also don't know if that's what you imagine).
>
> I think that you understood me correctly here.
>
>>> Most individual calls to btgettuple just return the next batched-up
>>> so->currPos tuple/TID via another call to _bt_next. Things like the
>>> _bt_first-new-primitive-scan case don't really add any complexity --
>>> the core concept of processing a page at a time still applies. It
>>> really is just a simple batching scheme, with a couple of extra fiddly
>>> details attached to it -- but nothing too hairy.
>>>
>>
>> True, although the details (how the batches are represented etc.) are
>> often quite different, so did you imagine some shared structure to
>> represent that, or wrapping that in a new callback?
>
> In what sense are they sometimes different?
>
> In general batches will consist of one or more groups of tuples, each
> of which is associated with a particular leaf page (if the scan
> returns no tuples for a given scanned leaf page then it won't form a
> part of the final batch). You can do amgettuple style scrolling back
> and forth with this structure, across page boundaries. Seems pretty
> general to me.
>

I meant that each of the AMs uses a separate typedef, with different
fields, etc. I'm sure there are similarities (it's always an array of
elements, either TIDs, index or heap tuples, or some combination of
that). But maybe there is stuff unique to some AMs - chances are that
can be either "generalized" or extended using some private member.

>> Yes, I understand that. Getting the overall design right is my main
>> concern, even if some of the advanced stuff is not implemented until
>> later. But with the wrong design, that may turn out to be difficult.
>>
>> That's the feedback I was hoping for when I kept bugging you, and this
>> discussion was already very useful in this regard. Thank you for that.
>
> I don't want to insist on doing all this. But it just seems really
> weird to have this shadow batching system for the so->currPos batches.
>
>>> The index AM API says that they need to hold on to a buffer pin to
>>> avoid confusing scans due to concurrent TID recycling by VACUUM. The
>>> index AM API fails to adequately describe what is expected here. And
>>> it provides no useful context for larger batching of index pages.
>>> nbtree already does its own thing by dropping leaf page pins
>>> selectively.
>>>
>>
>> Not sure I understand. I imagined the index AM would just read a
>> sequence of leaf pages, keeping all the same pins etc. just like it does
>> for the one leaf it reads right now (pins, etc.).
>
> Right. But it wouldn't necessarily drop the leaf pages right away. It
> might try to coalesce together multiple heap page accesses, for index
> tuples that happen to span page boundaries (but are part of the same
> higher level batch).
>

No opinion, but it's not clear to me how exactly would this work. I've
imagined we'd just acquire (and release) multiple pins as we go.

>> I'm probably too dumb for that, but I still don't quite understand how
>> that's different from just reading and processing that sequence of leaf
>> pages by amgettuple without batching.
>
> It's not so much different, as just more flexible. It's possible that
> v1 would effectively do exactly the same thing in practice. It'd only
> be able to do fancier things with holding onto leaf pages in a debug
> build, that validated the general approach.
>
>> True, although the necessary context could be passed to the index AM in
>> some way. That's what happens in the current patch, where indexam.c
>> could size the batch just right for a LIMIT clause, before asking the
>> index AM to fill it with items.
>
> What difference does it make where it happens? It might make some
> difference, but as I keep saying, the important point is that
> *somebody* has to know all of these things at the same time.
>

Agreed.

>>> I don't think that it'll ultimately be all that hard to schedule when
>>> and how index pages are read from outside of the index AM in question.
>>> In general all relevant index AMs already work in much the same way
>>> here. Maybe we can ultimately invent a way for the index AM to
>>> influence that scheduling, but that might never be required.
>>>
>>
>> I haven't thought about scheduling at all. Maybe there's something we
>> could improve in the future, but I don't see what would it look like,
>> and it seems unrelated to this patch.
>
> It's only related to this patch in the sense that we have to imagine
> that it'll be worth having in some form in the future.
>
> It might also be a good exercise architecturally. We don't need to do
> the same thing in several slightly different ways in each index AM.
>

Could you briefly outline how you think this might interact with the
scheduling of index page reads? I can imagine telling someone about
which future index pages we might need to read (say, the next leaf
page), or something like that. But this patch is about prefetching the
heap pages it seems like an entirely independent thing. And ISTM there
are concurrency challenges with prefetching index pages (at least when
leveraging read stream API to do async reads).

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2024-11-10 21:50:59 Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes
Previous Message Tom Lane 2024-11-10 21:37:12 Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes