Quick Links

Re: index prefetching

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Robert Haas <robertmhaas(at)gmail(dot)com>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject:	Re: index prefetching
Date:	2025-04-22 18:34:35
Message-ID:	e0200bfa-1e60-43be-a5ac-094c8b0fe40e@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 4/22/25 18:26, Peter Geoghegan wrote:
> On Tue, Apr 22, 2025 at 6:46 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> here's an improved (rebased + updated) version of the patch series, with
>> some significant fixes and changes. The patch adds infrastructure and
>> modifies btree indexes to do prefetching - and AFAIK it passes all tests
>> (no results, correct results).
>
> Cool!
>
>> Compared to the last patch version [1] shared on list (in November),
>> there's a number of significant design changes - a lot of this is based
>> on a number of off-list discussions I had with Peter Geoghegan, which
>> was very helpful.
>
> Thanks for being so receptive to my feedback. I know that I wasn't
> particularly clear. I mostly only gave you my hand-wavy, caveat-laden
> ideas about how best to layer things. But you were willing to give
> them full and fair consideration.
>
>> 1) patch now relies on read_stream
>
>> So I squashed these two parts, and the patch now does read_stream (for
>> the table reads) from the beginning.
>
> Make sense.
>
>> Based on the discussions with Peter I decided to make this a bit more
>> ambitious, moving the whole batch management from the index AM to the
>> indexam.c level. So now there are two callbacks - amgetbatch and
>> amfreebatch, and it's up to indexam.c to manage the batches - decide how
>> many batches to allow, etc. The index AM is responsible merely for
>> loading the next batch, but does not decide when to load or free a
>> batch, how many to keep in memory, etc.
>>
>> There's a section in indexam.c with a more detailed description of the
>> design, I'm not going to explain all the design details here.
>
> To me, the really important point about this high-level design is that
> it provides a great deal of flexibility around reordering work, while
> still preserving the appearance of an index scan that performs work in
> the same old fixed order. All relevant kinds of work (whether table AM
> and index AM related work) are under the direct control of one single
> module. There's one central place for a mechanism that weighs both
> costs and benefits, keeping things in balance.
>
> (I realize that there's still some sense in which that isn't true,
> partly due to the read stream interface, but for now the important
> thing is that we're agreed on this high level direction.)
>

Yeah, that makes sense, although I've been thinking about this a bit
differently. I haven't been trying to establish a new "component" to
manage prefetching. For me the question was what's the right layer, so
that unnecessary details don't leak into AM and/or executor.

The AM could issue fadvise prefetches, or perhaps even feed blocks into
a read_stream, but it doesn't seem like the right place to ever do more
decisions. OTOH we don't want every place in the executor to reimplement
the prefetching, and indexam.c seems like a good place in between.

It requires exchanging some additional details with the AM, provided by
the new callbacks.

It seems the indexam.c achieves both your and mine goals, more or less.

>> I think the indexam.c is a sensible layer for this. I was hoping doing
>> this at the "executor level" would mean no need for AM code changes, but
>> that turned out not possible - the AM clearly needs to know about the
>> batch boundaries, so that it can e.g. do killtuples, etc. That's why we
>> need the two callbacks (not just the "amgetbatch" one). At least this
>> way it's "hidden" by the indexam.c API, like index_getnext_slot().
>
> Right. But (if I'm not mistaken) the index AM doesn't actually need to
> know *when* to do killtuples. It still needs to have some handling for
> this, since we're actually modifying index pages, and we need to have
> handling for certain special cases (e.g., posting list tuples) on the
> scan side. But it can be made to work in a way that isn't rigidly tied
> to the progress of the scan -- it's perfectly fine to do this work
> somewhat out of order, if that happens to make sense. It doesn't have
> to happen in perfect lockstep with the scan, right after the items
> from the relevant leaf page have all been returned.
>
> It should also eventually be possible to do things like perform
> killtuples in a different process (perhaps even thread?) to the one
> that originally read the corresponding leaf page items. That's the
> kind of long term goal to keep in mind, I feel.
>

Right. The amfreebatch() does not mean the batch needs to be freed
immediately, it's just handed over back to the AM, and it's up to the AM
to do the necessary cleanup at some point. It might queue it for later,
or perhaps even do that in a separate thread ...

>> (You could argue indexam.c is "executor" and maybe it is - I don't know
>> where exactly to draw the line. I don't think it matters, really. The
>> "hidden in indexam API" is the important bit.)
>
> The term that I've used is "index scan manager", since it subsumes
> some of the responsibilities related to scheduling work that has
> traditionally been under the control of index AMs. I'm not attached to
> that name, but we should agree upon some name for this new concept. It
> is a new layer, above the index AM but below the executor proper, and
> so it feels like it needs to be clearly distinguished from the two
> adjoining layers.
>

Yes. I wonder if we should introduce a separate abstraction for this, as
a subset of indexam.c.

>> Or maybe we should
>> even rip out the amgettuple() entirely, and only support one of those
>> for each AM? That's what Peter suggested, but I'm not convinced we
>> should do that.
>
> Just to be clear, for other people reading along: I never said that we
> should fully remove amgettuple as an interface. What I said was that I
> think that we should remove btgettuple(), and any other amgettuple
> routine within index AMs that switch over to using the new interface.
>
> I'm not religious about removing amgettuple() from index AMs that also
> support the new batch interface. It's probably useful to keep around
> for now, for debugging purposes. My point was only this: I know of no
> good reason to keep around btgettuple in the first committed version
> of the patch. So if you're going to keep it around, you should surely
> have at least one explicit reason for doing so. I don't remember
> hearing such a reason?
>
> Even if there is such a reason, maybe there doesn't have to be. Maybe
> this reason can be eliminated by improving the batch design such that
> we no longer need btgettuple at all (not even for catalogs). Or maybe
> it won't be so easy -- maybe we'll have to keep around btgettuple
> after all. Either way, I'd like to know the details.
>

My argument was (a) ability to disable prefetching, and fall back to the
old code if needed, and (b) handling use cases where prefetching does
not work / is not implemented, even if only temporarily (e.g. ordered
scan in SP-GiST). Maybe (a) is unnecessarily defensive, and (b) may not
be needed. Not sure.

>> For now it was very useful to be able to flip between the APIs by
>> setting a GUC, and I left prefetching disabled in some places (e.g. when
>> accessing catalogs, ...) that are unlikely to benefit. But more
>> importantly, I'm not 100% we want to require the index AMs to support
>> prefetching for all cases - if we do, a single "can't prefetch" case
>> would mean we can't prefetch anything for that AM.
>
> I don't see why prefetching should be mandatory with this new
> interface. Surely it has to have adaptive "ramp-up" behavior already,
> even when we're pretty sure that prefetching is a good idea from the
> start?
>

Possibly, I may be too defensive. And perhaps in cases where we know the
prefetching can't help we could disable that for the read_stream.

>> In particular, I'm thinking about GiST / SP-GiST and indexes ordered by
>> distance, which don't return items in leaf pages but sort them through a
>> binary heap. Maybe we can do prefetch for that, but if we can't it would
>> be silly if it meant we can't do prefetch for any other SP-GiST queries.
>
> Again, I would be absolutely fine with continuing to support the
> amgettuple interface indefinitely. Again, my only concern is with
> index AMs that support both the old and new interfaces at the same
> time.
>

Understood.

>> Anyway, the current patch only implements prefetch for btree. I expect
>> it won't be difficult to do this for other index AMs, considering how
>> similar the design usually is to btree.
>>
>> This is one of the next things on my TODO. I want to be able to validate
>> the design works for multiple AMs, not just btree.
>
> What's the most logical second index AM to support, after nbtree,
> then? Probably hash/hashgettuple?
>

I think hash should be fairly easy to support. But I was really thinking
about doing SP-GiST, exactly because it's very different in some
aspects, and I wanted to validate the design on that (for hash I think
it's almost certain it's OK).

>> I think this is a consequence of read_stream having an internal idea how
>> far ahead to prefetch, based on the number of requests it got so far,
>> measured in heap blocks. It has not idea about the context (how that
>> maps to index entries, batches we need to keep in memory, ...).
>
> I think that that just makes read_stream an awkward fit for index
> prefetching. You legitimately need to see all of the resources that
> are in flight. That context will really matter, at least at times.
>
> I'm much less sure what to do about it. Maybe using read_stream is
> still the right medium-term design. Further testing/perf validation is
> required to be able to say anything sensible about it.
>

Agreed. That's why I've suggested it might help if the read_stream had
ability to pause/resume in some way, without having to stall for a while
(which the read_stream_reset workaround does). Based on what the
read_next callback decides.

>> But there are also cases where it doesn't (and can't) help very much.
>> For example fully-cached data, or index-only scans of all-visible
>> tables. I've done basic benchmarking based on that (I'll share some
>> results in the coming days), and in various cases I see a consistent
>> regression in the 10-20% range. The queries are very short (~1ms) and
>> there's a fair amount of noise, but it seems fairly consistent.
>
> I'd like to know more about these cases. I'll wait for your benchmark
> results, which presumably have examples of this.
>

I expect to have better data sometime next week.

I think the cases affected by this the most are index-only scans on
all-visible tables that fit into shared buffers, with
correlated/sequential pattern. Or even regular index scans with all data
in shred buffers.

It also seems quite hardware / CPU dependent - I see much worse impact
on an older Xeon than on a new Ryzen.

regards

--
Tomas Vondra

In response to

Re: index prefetching at 2025-04-22 16:26:01 from Peter Geoghegan

Responses

Re: index prefetching at 2025-04-22 21:16:49 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Pavel Borisov	2025-04-22 18:34:41	Re: Fortify float4 and float8 regression tests by ordering test results
Previous Message	Tom Lane	2025-04-22 18:25:02	Re: Fortify float4 and float8 regression tests by ordering test results