Quick Links

Re: index prefetching

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-04-22 16:26:01
Message-ID:	CAH2-Wz=Q6Q9A9owzEGFmqChqT2=4pfKxe-q_-fz8sDbVywzjWQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Tue, Apr 22, 2025 at 6:46 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> here's an improved (rebased + updated) version of the patch series, with
> some significant fixes and changes. The patch adds infrastructure and
> modifies btree indexes to do prefetching - and AFAIK it passes all tests
> (no results, correct results).

Cool!

> Compared to the last patch version [1] shared on list (in November),
> there's a number of significant design changes - a lot of this is based
> on a number of off-list discussions I had with Peter Geoghegan, which
> was very helpful.

Thanks for being so receptive to my feedback. I know that I wasn't
particularly clear. I mostly only gave you my hand-wavy, caveat-laden
ideas about how best to layer things. But you were willing to give
them full and fair consideration.

> 1) patch now relies on read_stream

> So I squashed these two parts, and the patch now does read_stream (for
> the table reads) from the beginning.

Make sense.

> Based on the discussions with Peter I decided to make this a bit more
> ambitious, moving the whole batch management from the index AM to the
> indexam.c level. So now there are two callbacks - amgetbatch and
> amfreebatch, and it's up to indexam.c to manage the batches - decide how
> many batches to allow, etc. The index AM is responsible merely for
> loading the next batch, but does not decide when to load or free a
> batch, how many to keep in memory, etc.
>
> There's a section in indexam.c with a more detailed description of the
> design, I'm not going to explain all the design details here.

To me, the really important point about this high-level design is that
it provides a great deal of flexibility around reordering work, while
still preserving the appearance of an index scan that performs work in
the same old fixed order. All relevant kinds of work (whether table AM
and index AM related work) are under the direct control of one single
module. There's one central place for a mechanism that weighs both
costs and benefits, keeping things in balance.

(I realize that there's still some sense in which that isn't true,
partly due to the read stream interface, but for now the important
thing is that we're agreed on this high level direction.)

> I think the indexam.c is a sensible layer for this. I was hoping doing
> this at the "executor level" would mean no need for AM code changes, but
> that turned out not possible - the AM clearly needs to know about the
> batch boundaries, so that it can e.g. do killtuples, etc. That's why we
> need the two callbacks (not just the "amgetbatch" one). At least this
> way it's "hidden" by the indexam.c API, like index_getnext_slot().

Right. But (if I'm not mistaken) the index AM doesn't actually need to
know *when* to do killtuples. It still needs to have some handling for
this, since we're actually modifying index pages, and we need to have
handling for certain special cases (e.g., posting list tuples) on the
scan side. But it can be made to work in a way that isn't rigidly tied
to the progress of the scan -- it's perfectly fine to do this work
somewhat out of order, if that happens to make sense. It doesn't have
to happen in perfect lockstep with the scan, right after the items
from the relevant leaf page have all been returned.

It should also eventually be possible to do things like perform
killtuples in a different process (perhaps even thread?) to the one
that originally read the corresponding leaf page items. That's the
kind of long term goal to keep in mind, I feel.

> (You could argue indexam.c is "executor" and maybe it is - I don't know
> where exactly to draw the line. I don't think it matters, really. The
> "hidden in indexam API" is the important bit.)

The term that I've used is "index scan manager", since it subsumes
some of the responsibilities related to scheduling work that has
traditionally been under the control of index AMs. I'm not attached to
that name, but we should agree upon some name for this new concept. It
is a new layer, above the index AM but below the executor proper, and
so it feels like it needs to be clearly distinguished from the two
adjoining layers.

> Or maybe we should
> even rip out the amgettuple() entirely, and only support one of those
> for each AM? That's what Peter suggested, but I'm not convinced we
> should do that.

Just to be clear, for other people reading along: I never said that we
should fully remove amgettuple as an interface. What I said was that I
think that we should remove btgettuple(), and any other amgettuple
routine within index AMs that switch over to using the new interface.

I'm not religious about removing amgettuple() from index AMs that also
support the new batch interface. It's probably useful to keep around
for now, for debugging purposes. My point was only this: I know of no
good reason to keep around btgettuple in the first committed version
of the patch. So if you're going to keep it around, you should surely
have at least one explicit reason for doing so. I don't remember
hearing such a reason?

Even if there is such a reason, maybe there doesn't have to be. Maybe
this reason can be eliminated by improving the batch design such that
we no longer need btgettuple at all (not even for catalogs). Or maybe
it won't be so easy -- maybe we'll have to keep around btgettuple
after all. Either way, I'd like to know the details.

> For now it was very useful to be able to flip between the APIs by
> setting a GUC, and I left prefetching disabled in some places (e.g. when
> accessing catalogs, ...) that are unlikely to benefit. But more
> importantly, I'm not 100% we want to require the index AMs to support
> prefetching for all cases - if we do, a single "can't prefetch" case
> would mean we can't prefetch anything for that AM.

I don't see why prefetching should be mandatory with this new
interface. Surely it has to have adaptive "ramp-up" behavior already,
even when we're pretty sure that prefetching is a good idea from the
start?

> In particular, I'm thinking about GiST / SP-GiST and indexes ordered by
> distance, which don't return items in leaf pages but sort them through a
> binary heap. Maybe we can do prefetch for that, but if we can't it would
> be silly if it meant we can't do prefetch for any other SP-GiST queries.

Again, I would be absolutely fine with continuing to support the
amgettuple interface indefinitely. Again, my only concern is with
index AMs that support both the old and new interfaces at the same
time.

> Anyway, the current patch only implements prefetch for btree. I expect
> it won't be difficult to do this for other index AMs, considering how
> similar the design usually is to btree.
>
> This is one of the next things on my TODO. I want to be able to validate
> the design works for multiple AMs, not just btree.

What's the most logical second index AM to support, after nbtree,
then? Probably hash/hashgettuple?

> I think this is a consequence of read_stream having an internal idea how
> far ahead to prefetch, based on the number of requests it got so far,
> measured in heap blocks. It has not idea about the context (how that
> maps to index entries, batches we need to keep in memory, ...).

I think that that just makes read_stream an awkward fit for index
prefetching. You legitimately need to see all of the resources that
are in flight. That context will really matter, at least at times.

I'm much less sure what to do about it. Maybe using read_stream is
still the right medium-term design. Further testing/perf validation is
required to be able to say anything sensible about it.

> But there are also cases where it doesn't (and can't) help very much.
> For example fully-cached data, or index-only scans of all-visible
> tables. I've done basic benchmarking based on that (I'll share some
> results in the coming days), and in various cases I see a consistent
> regression in the 10-20% range. The queries are very short (~1ms) and
> there's a fair amount of noise, but it seems fairly consistent.

I'd like to know more about these cases. I'll wait for your benchmark
results, which presumably have examples of this.

--
Peter Geoghegan

In response to

Re: index prefetching at 2025-04-22 10:45:55 from Tomas Vondra

Responses

Re: index prefetching at 2025-04-22 18:34:35 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-04-22 16:41:04	Re: What's our minimum supported Python version?
Previous Message	Andres Freund	2025-04-22 16:24:08	Re: ZStandard (with dictionaries) compression support for TOAST compression