From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Peter Geoghegan <pg(at)bowt(dot)ie>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
Subject: | Re: index prefetching |
Date: | 2025-04-22 10:45:55 |
Message-ID: | efac3238-6f34-41ea-a393-26cc0441b506@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
here's an improved (rebased + updated) version of the patch series, with
some significant fixes and changes. The patch adds infrastructure and
modifies btree indexes to do prefetching - and AFAIK it passes all tests
(no results, correct results). There's still a fair amount of work to be
done, of course - the btree changes are not very polished, more time
needs to be spent on profiling and optimization, etc. And I'm sure that
while the patch passes tests, there certainly are bugs.
Compared to the last patch version [1] shared on list (in November),
there's a number of significant design changes - a lot of this is based
on a number of off-list discussions I had with Peter Geoghegan, which
was very helpful. Let me try to sum the main conclusions and changes:
1) patch now relies on read_stream
The November patch still relied on sync I/O and PrefetchBuffer(). At
some point I added a commit switching it to read_stream - which turned
out non-trivial, especially for index-only scans. But it works, and for
a while I kept it separate - with PrefetchBuffer first, and a switch to
read_stream later. But then I realized it does not make much sense to
keep the first part - why would we introduce a custom fadvise-based
prefetch, only to immediately rip it out and replace it with with
read_stream code with a comparable amount of complexity, right?
So I squashed these two parts, and the patch now does read_stream (for
the table reads) from the beginning.
2) two new index AM callbacks - amgetbatch + amfreebatch
The [1] patch introduced a new callback for reading a "batch"
(essentially a leaf page) from the index. But there was a limitation of
only allowing a single batch at a time, which was causing trouble with
prefetch distance and read_stream stalls at the end of the batch, etc.
Based on the discussions with Peter I decided to make this a bit more
ambitious, moving the whole batch management from the index AM to the
indexam.c level. So now there are two callbacks - amgetbatch and
amfreebatch, and it's up to indexam.c to manage the batches - decide how
many batches to allow, etc. The index AM is responsible merely for
loading the next batch, but does not decide when to load or free a
batch, how many to keep in memory, etc.
There's a section in indexam.c with a more detailed description of the
design, I'm not going to explain all the design details here.
In a way, this design is a compromise between the initial AM-level
approach I presented as a PoC at pgconf.dev 2023, and the executor level
approach I shared a couple months back. Each of those "extreme" cases
had it's issues with either happening "too deep" or "too high" - being
too integrated in the AM, or not having enough info about the AM.
I think the indexam.c is a sensible layer for this. I was hoping doing
this at the "executor level" would mean no need for AM code changes, but
that turned out not possible - the AM clearly needs to know about the
batch boundaries, so that it can e.g. do killtuples, etc. That's why we
need the two callbacks (not just the "amgetbatch" one). At least this
way it's "hidden" by the indexam.c API, like index_getnext_slot().
(You could argue indexam.c is "executor" and maybe it is - I don't know
where exactly to draw the line. I don't think it matters, really. The
"hidden in indexam API" is the important bit.)
3) btree prefetch
The patch implements the new callbacks only for btree indexes, and it's
not very pretty / clean - it's mostly a massaged version of the old code
backing amgettuple(). This needs cleanup/improvements, and maybe
refactoring to allow reusing more of the code, etc.. Or maybe we should
even rip out the amgettuple() entirely, and only support one of those
for each AM? That's what Peter suggested, but I'm not convinced we
should do that.
For now it was very useful to be able to flip between the APIs by
setting a GUC, and I left prefetching disabled in some places (e.g. when
accessing catalogs, ...) that are unlikely to benefit. But more
importantly, I'm not 100% we want to require the index AMs to support
prefetching for all cases - if we do, a single "can't prefetch" case
would mean we can't prefetch anything for that AM.
In particular, I'm thinking about GiST / SP-GiST and indexes ordered by
distance, which don't return items in leaf pages but sort them through a
binary heap. Maybe we can do prefetch for that, but if we can't it would
be silly if it meant we can't do prefetch for any other SP-GiST queries.
Anyway, the current patch only implements prefetch for btree. I expect
it won't be difficult to do this for other index AMs, considering how
similar the design usually is to btree.
This is one of the next things on my TODO. I want to be able to validate
the design works for multiple AMs, not just btree.
4) duplicate blocks
While working on the patch, I realized the old index_fetch_heap code
skips reads for duplicate blocks - index the TID matches the immediately
preceding block, ReleaseAndReadBuffer() skips most of the work. But
read_stream() doesn't do that - if the callback returns the same block,
it starts a new read for it, pins it, etc. That can be quite expensive,
and I've seen a couple cases where the impact was not negligible
(correlated index, fits in memory, ...).
I've speculated that maybe read_stream_next_buffer() should detect and
handle these cases better - not unlike it detects sequential reads. It
might even keep a small cache of already requested reads, etc. so that
it can handle a wider range of workloads, not just perfect duplicates.
But it does not do that, and I'm not sure if/when that will happen. So
for now I simply reproduced the "skip duplicate blocks" behavior. It's
not as simple with read_stream, because this logic needs to happen in
two places - in the callback (when generating reads), and then also when
reading the blocks from the stream - if these places get "out of sync"
the stream won't return the blocks expected by the reader.
But it does work, and it's not that complex. But there's an issue with
prefetch distance ...
5) prefetch distance
Traditionally, we measure distance in "tuples" - e.g. in bitmap heap
scan, we make sure we prefetched pages for X tuples ahead. But that's
not what read_stream does for prefetching - it works with pages. That
can cause various issues.
Consider for example the "skip duplicate blocks" optimization described
in (4). And imagine a perfectly correlated index, with ~200 items per
leaf page. The heap tuples are likely wider, let's say we have 50 of
them per page. That means that for each leaf page, we have only ~4
blocks per leaf page. With effective_io_concurrency=16 the read_stream
will try to prefetch 16 heap pages, that's 3200 index entries.
Is that what we want? I'm not quite sure, maybe it's OK? It sure is not
quite what I expected.
But now imagine an index-only scan on nearly all-visible table. If the
fraction of index entries that don't pass the visibility check is very
low, we can quickly get into a situation when the read_stream has to
read a lot of leaf pages to get the next block number.
Sure, we'd need to read that block number eventually, but doing it this
early means we may need to keep the batch (leaf page) - a lot of them,
actually. Essentially, pick a number and I can construct an IOS that
needs to keep more batches.
I think this is a consequence of read_stream having an internal idea how
far ahead to prefetch, based on the number of requests it got so far,
measured in heap blocks. It has not idea about the context (how that
maps to index entries, batches we need to keep in memory, ...).
Ideally, we'd be able to give this feedback to read_stream in some way,
say by "pausing" it when we get too far ahead in the index. But we don't
have that - the only thing we can do is to return IndalidBlockNumber to
the stream, so that it stops. And then we need to "reset" the stream,
and let it continue - but only after we consumed all scheduled reads.
In principle it's very similar to the "pause/resume" I mentioned, except
that it requires completely draining the queue - a pipeline stall.
That's not great, but hopefully it's not very common, and more
importantly - it only happens when only a tiny fraction of the index
items requires a heap block.
So that's what the patch does. I think it's acceptable, but some
optimizations may be necessary (see next section).
6) performance and optimization
It's not difficult to construct cases where the prefetching is a huge
improvement - 5-10x speedup for a query is common, depending on the
hardware, dataset, etc.
But there are also cases where it doesn't (and can't) help very much.
For example fully-cached data, or index-only scans of all-visible
tables. I've done basic benchmarking based on that (I'll share some
results in the coming days), and in various cases I see a consistent
regression in the 10-20% range. The queries are very short (~1ms) and
there's a fair amount of noise, but it seems fairly consistent.
I haven't figured out the root cause(s) yet, but I believe there's a
couple contributing factors:
(a) read_stream adds a bit of complexity/overhead, but these cases
worked great with just the sync API, and can't benefit from that.
(b) There's inefficiencies in how I integrated read_stream into the
btree AM. For example every batch allocates the same buffer btbeginscan,
which turned out to be an issue before [2] - and now we do that for
every batch, not just once per scan - that's not great.
(c) Possibly the prefetch distance issue from (4) might matter too.
regards
[1]
https://www.postgresql.org/message-id/accd03eb-0379-416d-9936-41a4de3c47ef%40vondra.me
[2]
https://www.postgresql.org/message-id/510b887e-c0ce-4a0c-a17a-2c6abb8d9a5c@enterprisedb.com
regards
--
Tomas Vondra
Attachment | Content-Type | Size |
---|---|---|
v20250422-0001-WIP-index-prefetching.patch | text/x-patch | 92.9 KB |
v20250422-0002-WIP-batching-for-nbtree-indexes.patch | text/x-patch | 94.3 KB |
v20250422-0003-WIP-Don-t-read-the-same-block-repeatedly.patch | text/x-patch | 8.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Nikolay Shaplov | 2025-04-22 11:16:27 | Re: Check for tuplestorestate nullness before dereferencing |
Previous Message | Frédéric Yhuel | 2025-04-22 10:43:10 | Re: [BUG] temporary file usage report with extended protocol and unnamed portals |