Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2024-11-10 22:40:43
Message-ID:	CAH2-Wz=G_HKYZ34-UkBMQjk7SCNtu9O5V4t=A0u82=WT-rtBuw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Nov 10, 2024 at 4:41 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Is it a good idea to make this part (in indexam.c) aware of /
> responsible for managing stuff like pins?

My sense is that that's the right long term architectural direction. I
can't really prove it.

> Perhaps it'd work fine for
> index AMs that always return an array of items for a single leaf-page
> (like btree or hash). But I'm still thinking about cases like gist with
> ORDER BY clauses, or maybe something even weirder in custom AMs.

Nothing is perfect. What you really have to worry about not supporting
is index AMs that implement amgettuple -- AMs that aren't quite a
natural fit for this. At least for in-core index AMs that's really
just GiST (iff KNN GiST is in use, which it usually isn't) plus
SP-GiST.

AFAIK most out-of-core index AMs only support lossy index scans in
practice. Just limiting yourself to that makes an awful lot of things
easier. For example I think that GIN gets away with a lot by only
supporting lossy scans -- there's a comment above ginInsertCleanup()
that says "On first glance it looks completely not crash-safe", but
stuff like that is automatically okay with lossy scans. So many index
AMs automatically don't need to be considered here at all.

> It seems to me knowing which pages may be pinned is very AM-specific
> knowledge, and my intention was to let the AM to manage that.

This is useful information, because it helps me to understand how
you're viewing this.

I totally disagree with this characterization. This is an important
difference in perspective. IMV index AMs hardly care at all about
holding onto buffer pins, very much unlike heapam.

I think that holding onto pins and whatnot has almost nothing to do
with the index AM as such -- it's about protecting against unsafe
concurrent TID recycling, which is a table AM/heap issue. You can make
a rather weak argument that the index AM needs it for _bt_killitems,
but that seems very secondary to me (if you go back long enough there
are no _bt_killitems, but the pin thing itself still existed).

As I pointed out before, the index AM API docs (at
https://www.postgresql.org/docs/devel/index-locking.html) talk about
holding onto buffer pins on leaf pages during amgettuple. So the need
to mess around with pins just doesn't come from the index AM side, at
all. The cleanup lock interlock against TID recycling protects the
scan from seeing transient wrong answers -- it doesn't protect the
index structure itself.

The only thing that's a bit novel about what I'm proposing now is that
I'm imagining that it'll be possible to eventually usefully schedule
multi-leaf-page batches using code that has no more than a very
general notion of how an ordered index scan works. That might turn out
to be more complicated than I suppose it will now. If it is then it
should still be fixable.

> That is,
> the new indexam code would be responsible for deciding when the "AM
> batches" are loaded and released, using the two new callbacks. But it'd
> be the AM responsible for making sure everything is released.

What does it really mean for the index AM to be responsible for a
thing? I think that the ReleaseBuffer() calls would be happening in
index AM code, for sure. But that would probably always be called
through your new index scan management code in practice.

I don't have any fixed ideas about the resource management aspects of
this. That doesn't seem particularly fundamental to the design.

> I agree that in the simple cases it's not difficult to determine what
> pins we need for the sequence of tuples/pages. But is it guaranteed to
> be that easy, and is it easy to communicate this information to the
> indexam.c layer?

I think that it's fairly generic. The amount of work required to read
an index page is (in very round numbers) more or less uniform across
index AMs. Maybe you'd need to have some kind of way of measuring how
many pages you had to read without returning any tuples, for
scheduling purposes -- that cost is a relevant cost, and so would
probably have to be tracked. But that still seems fairly general --
any kind of order index scan is liable to sometimes scan multiple
pages without having any index tuples to return.

> Sure, maybe it'd need some more information - say, how many items we
> expect to read, but if indexam knows that bit, surely it can pass it
> down to the AM.

What are you arguing for here? Practically speaking, I think that the
best way to do it is to have one layer that manages all this stuff. It
would also be possible to split it up any way you can think of, but
why would you want to?

I'm not asking you to solve these problems. I'm only suggesting that
you move things in a direction that is amenable to adding these things
later on.

> By generalizing you mean defining a common struct serving the same
> purpose, but for all the index AMs? And the new AM callbacks would
> produce/consume this new struct, right?

Yes.

> I don't think I suggested the index AM would need to know about the
> batch size. Only indexam.c would be aware of that, and would read enough
> stuff from the index to satisfy that.

I don't think that you'd ultimately want to make the batch sizes fixed
(though they'd probably always consist of tuples taken from 1 or more
index pages). Ultimately the size would vary over time, based on
competing considerations.

> > The nbtree code will know about buffer pins held, in the sense that
> > it'll be the one setting the Buffer variables in the new scan
> > descriptor thing. But it's not going to remember to drop those buffer
> > pins on its own. It'll need to be told. So it's not ever really in
> > control.
> >
> Right. So those pins would be released after indexam invokes the second
> new callback, instructing the index AM to release everything associated
> with a chunk of items returned sometime earlier.

Yes. It might all look very similar to today, at least for your
initial commited version.

You might also want to combine reading the next page with dropping the
pin on the previous page. But also maybe not.

> OK. The thing that worries me is whether it's going to be this simple
> for other AMs. Maybe it is, I don't know.

Really? I mean if we're just talking about the subset of GiST scans
that use KNN-GiST as well as SP-GiST scans not using your new
facility, that seems quite acceptable to me.

> I don't recall my reasoning, and I'm not saying it was the right
> instinct. But if we have one callback to read tuples, it seemed like
> maybe we should have one callback to read a bunch of tuples in a similar
> way.

The tuple-level interface will still need to exist, of course. It just
won't be directly owned by affected index AMs.

> I meant that each of the AMs uses a separate typedef, with different
> fields, etc. I'm sure there are similarities (it's always an array of
> elements, either TIDs, index or heap tuples, or some combination of
> that). But maybe there is stuff unique to some AMs - chances are that
> can be either "generalized" or extended using some private member.

Right. Maybe it won't even be that hard to do SP-GiST and KNN-GiST
index scans with this too.

> No opinion, but it's not clear to me how exactly would this work. I've
> imagined we'd just acquire (and release) multiple pins as we go.

More experimentation is required to get good intuitions about how
useful it is to reorder stuff, to make heap prefetching work best.

> Could you briefly outline how you think this might interact with the
> scheduling of index page reads? I can imagine telling someone about
> which future index pages we might need to read (say, the next leaf
> page), or something like that. But this patch is about prefetching the
> heap pages it seems like an entirely independent thing.

I agree that prefetching of index pages themselves would be entirely
independent (and probably much less useful). I wasn't talking about
that at all, though. I was talking about the potential value in
reading multiple leaf pages at a time as an enabler of heap
prefetching -- to avoid "pipeline stalls" for heap prefetching, with
certain workloads.

The simplest example of how these two things (heap prefetching and
eager leaf page reading) could be complementary is the idea of
coalescing together accesses to the same heap page from TIDs that
don't quite appear in order (when read from the index), but are
clustered together. Not just clustered together on one leaf page --
clustered together on a few sibling leaf pages. (The exactly degree to
which you'd vary how many leaf pages you read at a time might need to
be fully dynamic/adaptive.)

We've talked about this already. Reading multiple index pages at a
time could in general result in pinning/reading the same heap pages
far less often. Imagine if our scan will inherently need to read a
total of no more than 3 or 4 index leaf pages. Reading all of those
leaf pages in one go probably doesn't add any real latency, but
literally guarantees that no heap page will need to be accessed twice.
So it's almost a hybrid of an index scan and bitmap index scan,
offering the best of both worlds.

--
Peter Geoghegan

In response to

Re: index prefetching at 2024-11-10 21:41:33 from Tomas Vondra

Responses

Re: index prefetching at 2024-11-11 17:23:00 from Robert Haas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pavel Stehule	2024-11-10 22:41:31	Re: proposal: schema variables
Previous Message	Tom Lane	2024-11-10 22:14:38	Re: [BUG] psql: Make \copy from 'text' and 'csv' formats fail on NUL bytes