Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-11-06 17:25:41
Message-ID: 8b58ae08-9ed7-43aa-92b0-1976ea6385ed@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is an updated version of this patch series. The first couple
parts (adding batching + updating built-in index AMs) remain the same,
the new part is 0007 which switches index scans to read stream API.

We tried to use read stream API for index prefetching before, and it
didn't seem to be a good fit with that design of the patch. It wasn't
clear what to do about kill tuples, the index AM and read stream had no
way to communicate, etc.

I speculated that with the batching concept it might work better, and I
think that turned out to be the case. The batching is still the core
idea, giving the index AM enough control to make kill tuples work (by
not generating batches spanning multiple leaf pages, or doing something
smarter). And the read stream leverages that too - the next_block
callback returns items from the current batch, and the stream is reset
between batches. This is the same prefetch restriction as with the
explicit prefetching (done using posix_fadvise), except that the
prefetching is done by the read stream.

As I said before, I think this is an acceptable restriction for v1. It
can be relaxed in the future if needed, allowing either cross-leaf
batches and/or multiple in-flight batches. But the patch is complex
enough, and even this simpler patch gives significant benefit.

The main open questions are about how to structure the patch - in which
order to introduce the changes. Right now the patch adds batching, and
then bases the read stream on that.

The batching part now includes explicit prefetching, which is later
removed by 0007. That's mostly to allow comparisons with the read
stream, because that's interesting. Ultimately the explicit prefetch
should be removed, and it'd be just basic batching + read stream.

But then in which order we should introduce the parts? Now the batching
is introduced first, followed by read stream. But I can imagine doing it
the other way too - introducing read stream, and then batching.

Of course, without batching the read stream can't do any prefetching
(for index scans). It'd only simply read the heap pages 1 by 1, just
like now. Only with the batching part it'd be able to prefetch.

I don't see either of those options as obviously superior, but maybe
there are good reasons to pick one? Opinions?

A related question is whether all index scans should use ther read
stream API, or whether there should be a fallback to "regular" read
through ReadBuffer. Right now the read stream is an optional field in
IndexFetchHeapData, initialized only for index AMs supporting batching,
with a couple exceptions for cases where we don't expect batching (and
prefetch) to be very effective (e.g. for systables).

All built-in AMs do support batching, but my plan was to keep the
optional, and I can't predict if all index AMs can do batching (easily
or even at all).

I've thought maybe we could simulate batching for those AMs by simply
treating individual items (returned by amgettuple) as tiny single-item
batches, and just do everything through the read stream.

But the annoying consequence is that we'd have to reset the stream after
every item, because there's no way to "pause" the stream once it runs
out of the current batch. I haven't measured how expensive that is,
maybe not much, but it seems a bit inconvenient.

I wonder if there's a more natural / convenient way to handle this, when
we really can't look further ahead than at the very next item.

If this "single-item" batch idea is not usable, that means we can't
introduce read stream first. We have to introduce batching first, and
only then do the read stream change.

Opinions? I hope this wasn't too confusing :-(

regards

--
Tomas Vondra

Attachment Content-Type Size
v20241106-0001-WIP-index-batching-prefetching.patch text/x-patch 59.5 KB
v20241106-0002-WIP-batching-for-nbtree-indexes.patch text/x-patch 15.0 KB
v20241106-0003-PoC-support-for-mark-restore-for-nbtree.patch text/x-patch 13.1 KB
v20241106-0004-WIP-batching-for-hash-indexes.patch text/x-patch 14.3 KB
v20241106-0005-WIP-batching-for-gist-indexes.patch text/x-patch 11.7 KB
v20241106-0006-WIP-batching-for-sp-gist-indexes.patch text/x-patch 7.0 KB
v20241106-0007-WIP-stream-read-API.patch text/x-patch 47.7 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-11-06 17:53:07 Re: [BUG] Fix DETACH with FK pointing to a partitioned table fails
Previous Message Michael Christofides 2024-11-06 16:57:25 Re: Proposals for EXPLAIN: rename ANALYZE to EXECUTE and extend VERBOSE