From: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com> |
Subject: | Re: index prefetching |
Date: | 2023-11-24 16:25:52 |
Message-ID: | 367160ea-b1ed-4481-e804-bca509128878@enterprisedb.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Here's a new WIP version of the patch set adding prefetching to indexes,
exploring a couple alternative approaches. After the patch 2023/10/16
version, I happened to have an off-list discussion with Andres, and he
suggested to try a couple things, and there's a couple more things I
tried on my own too.
Attached is the patch series starting with the 2023/10/16 patch, and
then trying different things in separate patches (discussed later). As
usual, there's also a bunch of benchmark results - due to size I'm
unable to attach all of them here (the PDFs are pretty large), but you
can find them at (with all the scripts etc.):
https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23
I'll attach only a couple small PNG with highlighted speedup/regression
patterns, but it's unreadable and more of a pointer to the PDF.
A quick overview of the patches
-------------------------------
v20231124-0001-prefetch-2023-10-16.patch
- same as the October 16 patch, with only minor comment tweaks
v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patch
- removes custom cache of recently prefetched blocks, replaces it
simply by calling PrefetchBuffer (which check shared buffers)
v20231124-0003-check-page-cache-using-preadv2.patch
- adds a check using preadv2(RWF_NOWAIT) to check if the whole
page is in page cache
v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patch
- adds back a small LRU cache to identify sequential patterns
(based on benchmarks of 0002/0003 patches)
v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patch
v20231124-0006-poc-reuse-vm-information.patch
- optimizes the visibilitymap handling when prefetching for IOS
(to deal with overhead in the all-visible cases) by
v20231124-0007-20231016-reworked.patch
- returns back to the 20231016 patch, but this time with the VM
optimizations in patches 0005/0006 (in retrospect I might have
simply moved 0005+0006 right after 0001, but the patch evolved
differently - shouldn't matter here)
Now, let's talk about the patches one by one ...
PrefetchBuffer + preadv2 (0002+0003)
------------------------------------
After I posted the patch in October, I happened to have an off-list
discussion with Andres, and he suggested to try ditching the local cache
of recently prefetched blocks, and instead:
1) call PrefetchBuffer (which checks if the page is in shared buffers,
and skips the prefetch if it's already there)
2) if the page is not in shared buffers, use preadv2(RWF_NOWAIT) to
check if it's in the kernel page cache
Doing (1) is trivial - PrefetchBuffer() already does the shared buffer
check, so 0002 simply removes the custom cache code.
Doing (2) needs a bit more code to actually call preadv2() - 0003 adds
FileCached() to fd.c, smgrcached() to smgr.c, and then calls it from
PrefetchBuffer() right before smgrprefetch(). There's a couple loose
ends (e.g. configure should check if preadv2 is supported), but in
principle I think this is generally correct.
Unfortunately, these changes led to a bunch of clear regressions :-(
Take a look at the attached point-4-regressions-small.png, which is page
5 from the full results PDF [1][2]. As before, I plotted this as a huge
pivot table with various parameters (test, dataset, prefetch, ...) on
the left, and (build, nmatches) on the top. So each column shows timings
for a particular patch and query returning nmatches rows.
After the pivot table (on the right) is a heatmap, comparing timings for
each build to master (the first couple of columns). As usual, the
numbers are "timing compared to master" so e.g. 50% means the query
completed in 1/2 the time compared to master. Color coding is simple
too, green means "good" (speedup), red means "bad" (regression). The
higher the saturation, the bigger the difference.
I find this visualization handy as it quickly highlights differences
between the various patches. Just look for changes in red/green areas.
In the points-5-regressions-small.png image, you can see three areas of
clear regressions, either compared to the master or the 20231016 patch.
All of this is for "uncached" runs, i.e. after instance got restarted
and the page cache was dropped too.
The first regression is for bitmapscan. The first two builds show no
difference compared to master - which makes sense, because the 20231016
patch does not touch any code used by bitmapscan, and the 0003 patch
simply uses PrefetchBuffer as is. But then 0004 adds preadv2 to it, and
the performance immediately sinks, with timings being ~5-6x higher for
queries matching 1k-100k rows.
The patches 0005/0006 can't possibly improve this, because visibilitymap
are entirely unrelated to bitmapscans, and so is the small LRU to
detect sequential patterns.
The indexscan regression #1 shows a similar pattern, but in the opposite
direction - indesxcan cases massively improved with the 20231016 patch
(and even after just using PrefetchBuffer) revert back to master with
0003 (adding preadv2). Ditching the preadv2 restores the gains (the last
build results are nicely green again).
The indexscan regression #2 is interesting too, and it illustrates the
importance of detecting sequential access patterns. It shows that as
soon as we call PrefetBuffer() directly, the timings increase to maybe
2-5x compared to master. That's pretty terrible. Once the small LRU
cache used to detect sequential patterns is added back, the performance
recovers and regression disappears. Clearly, this detection matters.
Unfortunately, the LRU can't do anything for the two other regresisons,
because those are on random/cyclic patterns, so the LRU won't work
(certainly not for the random case).
preadv2 issues?
---------------
I'm not entirely sure if I'm using preadv2 somehow wrong, but it doesn't
seem to perform terribly well in this use case. I decided to do some
microbenchmarks, measuring how long it takes to do preadv2 when the
pages are [not] in cache etc. The C files are at [3].
preadv2-test simply reads file twice, first with NOWAIT and then without
it. With clean page cache, the results look like this:
file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192
preadv2 NOWAIT time 78472 us calls 131072 hits 0 misses 131072
preadv2 WAIT time 9849082 us calls 131072 hits 131072 misses 0
and then, if you run it again with the file still being in page cache:
file: ./tmp.img size: 1073741824 (131072) block 8192 check 8192
preadv2 NOWAIT time 258880 us calls 131072 hits 131072 misses 0
preadv2 WAIT time 213196 us calls 131072 hits 131072 misses 0
This is pretty terrible, IMO. It says that if the page is not in cache,
the preadv2 calls take ~80ms. Which is very cheap, compared to the total
read time (so if we can speed that up by prefetching, it's worth it).
But if the file is already in cache, it takes ~260ms, and actually
exceeds the time needed to just do preadv2() without the NOWAIT flag.
AFAICS the problem is preadv2() doesn't just check if the data is
available, it also copies the data and all that. But even if we only ask
for the first byte, it's still way more expensive than with empty cache:
file: ./tmp.img size: 1073741824 (131072) block 8192 check 1
preadv2 NOWAIT time 119751 us calls 131072 hits 131072 misses 0
preadv2 WAIT time 208136 us calls 131072 hits 131072 misses 0
There's also a fadvise-test microbenchmark that just does fadvise all
the time, and even that is way cheaper than using preadv2(NOWAIT) in
both cases:
no cache:
file: ./tmp.img size: 1073741824 (131072) block 8192
fadvise time 631686 us calls 131072 hits 0 misses 0
preadv2 time 207483 us calls 131072 hits 131072 misses 0
cache:
file: ./tmp.img size: 1073741824 (131072) block 8192
fadvise time 79874 us calls 131072 hits 0 misses 0
preadv2 time 239141 us calls 131072 hits 131072 misses 0
So that's 300ms vs. 500ms in the caches case (the difference in the
no-cache case is even more significant).
It's entirely possible I'm doing something wrong, or maybe I just think
about this the wrong way, but I can't quite imagine this being useful
for this working - at least not for reasonably good local storage. Maybe
it could help for slow/remote storage, or something?
For now, I think the right approach is to go back to the cache of
recently prefetched blocks. I liked on the preadv2 approach is that it
knows exactly what is currently in page cache, while the local cache is
just an approximation cache of recently prefetched blocks. And it also
knows about stuff prefetched by other backends, while the local cache is
private to the particular backend (or even to the particular scan node).
But the local cache seems to perform much better, so there's that.
LRU cache of recent blocks (0004)
---------------------------------
The importance of this optimization is clearly visible in the regression
image mentioned earlier - the "indexscan regression #2" shows that the
sequential pattern regresses with 0002+0003 patches, but once the small
LRU cache is introduced back and uses to skip prefetching for sequential
patterns, the regression disappears. Ofc, this is part of the origina
20231016 patch, so going back to that version naturally includes this.
visibility map optimizations (0005/0006)
----------------------------------------
Earlier benchmark results showed a bit annoying regression for
index-only scans that don't need prefetching (i.e. with all pages
all-visible). There was quite a bit of inefficiency because both the
prefetcher and IOS code accessed the visibilitymap independently, and
the prefetcher did that in a rather inefficient way. These patches make
the prefetcher more efficient by reusing buffer, and also share the
visibility info between prefetcher and the IOS code.
I'm sure this needs more work / cleanup, but the regresion is mostly
gone, as illustrated by the attached point-0-ios-improvement-small.png.
layering questions
------------------
Aside from the preadv2() question, the main open question remains to be
the "layering", i.e. which code should be responsible for prefetching.
At the moment all the magic happens in indexam.c, in index_getnext_*
functions, so that all callers benefit from prefetching.
But as mentioned earlier in this thread, indexam.c seems to be the wrong
layer, and I think I agree. The problem is - the prefetching needs to
happen in index_getnext_* so that all index_getnext_* callers benefit
from it. We could do that in the executor for index_getnext_tid(), but
that's a bit weird - it'd work for index-only scans, but the primary
target is regular index scans, which calls index_getnext_slot().
However, it seems it'd be good if the prefetcher and the executor code
could exchange/share information more easily. Take for example the
visibilitymap stuff in IOS in patches 0005/0006). I made it work, but it
sure looks inconvenient, partially due to the split between executor and
indexam code.
The only idea I have is to have the prefetcher code somewhere in the
executor, but then pass it to index_getnext_* functions, either as a new
parameter (with NULL => no prefetching), or maybe as a field of scandesc
(but that seems wrong, to point from the desc to something that's
essentially a part of the executor state).
There's also the thing that the prefetcher is part of IndexScanDesc, but
it really should be in the IndexScanState. That's weird, but mostly down
to my general laziness.
regards
[1]
https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/pdf/point.pdf
[2]
https://github.com/tvondra/index-prefetch-tests/blob/master/2023-11-23/png/point-4.png
[3]
https://github.com/tvondra/index-prefetch-tests/tree/master/2023-11-23/preadv-tests
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachment | Content-Type | Size |
---|---|---|
v20231124-0001-prefetch-2023-10-16.patch | text/x-patch | 53.5 KB |
v20231124-0002-rely-on-PrefetchBuffer-instead-of-custom-c.patch | text/x-patch | 22.3 KB |
v20231124-0003-check-page-cache-using-preadv2.patch | text/x-patch | 7.9 KB |
v20231124-0004-reintroduce-the-LRU-cache-of-recent-blocks.patch | text/x-patch | 8.2 KB |
v20231124-0005-hold-the-vm-buffer-for-IOS-prefetching.patch | text/x-patch | 2.4 KB |
v20231124-0006-poc-reuse-vm-information.patch | text/x-patch | 8.8 KB |
v20231124-0007-20231016-reworked.patch | text/x-patch | 18.7 KB |
point-0-ios-improvement-small.png | image/png | 187.7 KB |
point-4-regressions-small.png | image/png | 221.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2023-11-24 16:34:32 | Re: [HACKERS] pg_upgrade vs vacuum_cost_delay |
Previous Message | vignesh C | 2023-11-24 15:35:26 | Re: pg_upgrade and logical replication |