Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-08-31 20:37:31
Message-ID: 6761860c-c66e-400a-b45f-aa741f548771@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's an updated (and pretty fundamentally reworked) patch to add
prefetching to regular index scans. I'm far happier with this approach
than with either of the two earlier ones, and I actually think it might
even be easier to combine this with the streaming read (which the patch
does not use at all for now). I feeling cautiously optimistic.

The patch is still WIP, but everything should be working fine (including
optimizations like kill_prior_tuple etc.). The patch actually passes
"make check-world" (even with valgrind) and I'm not aware of any bugs.
There are a couple limitations and things that need cleanup, ofc. Those
are mentioned at the end of this message.

The index prefetching had two prior patch versions, with very different
approaches, each having different drawbacks. The first one (posted
shortly before pgcon 2023) did the prefetching at a very low level, in
each index AM. We'd call amgettuple() -> btgettuple(), and that issued
prefetches for "future" TIDs from the same leaf page (in the correct
prefetch distance, etc).

That mostly worked ... sort of. Every index AM had to reimplement the
logic, but the main problem was that it had no idea what happened above
the index AM. So it regressed cases that are unlikely to benefit from
prefetches - like IOS, where we don't need the heap page at all if it's
all-visible. And if we disabled prefetching for IOS, it could easily
lead to cases where regular index scan is much faster than IOS (which
for users would seem quite bizarre).

We'd either need to teach the index AM about visibility checks (seems it
should not need to know about that), or inject the information in some
way, but then also cache the visibility check results (because checking
visibility map is not free, and doing it repeatedly can regresses the
"cached" case of IOS).

Perhaps that was solvable, but it felt uglier and uglier, and in the end
my conclusion was it's not the right place to do the prefetches. Why
should an index AM initiate prefetches against a heap? It seems the
right place to do prefetches is somewhere higher, where we actually have
the information to decide if the heap page is needed. (I believe this
uncertainty made it harder to adopt streaming read API too.)

This led to the second patch, which did pretty much everything in the
executor. The Index(Only)Scans simply called index_getnext_tid() in a
loop to fill a local "queue" driving the prefetching, and then also
consumed the TIDs from it again. The nice thing was this seemed to work
with any index AM as long as it had the amgettuple() callback.

Unfortunately, this complete separation of prefetching from index AM
turned out to be a problem. The ultimate issue that killed this was the
kill_prior_tuple, which we use to "remove" pointers to provably dead
heap tuples from the index early. With the single-tuple approach the
index AM processes the information before it unpins the leaf page, but
with a batch snapping multiple leaf pages, we can't rely on that - we
might have unpinned the page long before we get to process the list of
tuples to kill.

We have discussed different ways to deal with this - an obvious option
is to rework the index AMs to hold pins on all leaf pages needed by the
current batch. But despite the "obviousness" it's a pretty unattractive
option. It would require a lot of complexity and reworks in each index
AM to support this, which directly contradicts the primary benefit of
doing this in the executor - not having to do anything in the index AMs
and working for all index AMs.

Also, locking/pinning resources accessed asynchronously seems like a
great place for subtle bugs.

However, I had a bit of a lightbulb moment at pgconf.dev, when talking
to Andres about something only very remotely related, something to do
with accessing batches of items instead of individually.

What if we didn't get the TIDs from the index one by one, but in larger
batches, and the index AM never gave us a batch spanning multiple leaf
pages? A sort of a "contract" for the API.

Yes, this requires extending the index AM. The existing amgettuple()
callback is not sufficient for that, because we don't know when leaf
pages change. Or will change, which makes it hard to communicate
information about past tuples.

There's a fairly long comment in indexam.c before the chunk of new code,
trying to explain how this is supposed to work. There's also a lot of
XXX comments scattered around, with open questions / ideas about various
parts of this.

But let me share a brief overview here ...

The patch adds a new callback amgettuplebatch() which loads an array of
items (into IndexScanDesc). It also adds index_batch_getnext() and
index_batch_getnext_tid() wrappers to access the batch.

This means if we have loop reading tuples from an indexscan

while ((tid = index_getnext_slot(scan, dir, slot)) != NULL)
{
... process the slot ...
}

we could replace it with something like

while (index_batch_getnext(scan, dir))
{
while ((tid = index_batch_getnext_slot(scan, dir, slot)) != NULL)
{
... process the slot ...
}
}

Obviously, nodeIndescan.c does that a bit differently, but I think the I
idea is clear. For index-only scans it'd be more complicated, due to
visibility checks etc. but the overall idea is the same.

For kill_prior_tuple, the principle is about the same, except that we
collect information about which tuples to kill in the batch, and the AM
only gets the information before reading the next batch - at which point
it simply adds them to the private list and kills them when switching to
the next leaf page.

Obviously, this requires some new code in the index AM - I don't think
there's a way around that, the index AM has to have a say in this one
way or the other. Either it has to keep multiple leaf pages pinned, or
it needs to generate batches in a way that works with a single pin.

I've only done this for btree for now, but the amount of code needed is
pretty small - essentially I needed the btgettuplebatch, which is maybe
20 lines plus comments, and then _bt_first_batch/_bt_next_batch, which
are just simplified versions of _bt_first/_bt_next.

The _bt_first_batch/_bt_next_batch are a bit long, but there's a lot of
redundancy and it shouldn't be hard to cut them down to ~1/2 with a bit
of effort. I'm pretty sure other index AMs (e.g. hash) can do a very
similar approach to implement this.

A detail worth mentioning - the batches start small and gradually grow
over time, up to some maximum size (the patch hardcodes these limits as
8 and 64, at the moment). The reason are similar to why we do this for
prefetching - not harming queries that only need a single row.

The changes to nodeIndexscan.c and nodeIndexonlyscan.c have a lot of
duplicate code too. That's partially intentional - I wanted to retain
the ability to test the "old" code easily, so I added a GUC to switch
between the two.

For plain indexscans it might even be possible to "unite" the two paths
by tweaking index_getnext_slot to either get the TID from the index or
do the batch loop (with batching enabled). Not sure about IOS, we don't
want to repeat the visibility check in that case :-(

Actually, couldn't we have a per-batch cache of visibility checks? I
don't think we can get different answers to visibility checks for two
TIDs (for the same block) within the same batch, right? It'd simplify
the code I think, and perhaps it'd be useful even without prefetching.

I think the main priority is clarifying the boundary between indexam and
the AM code. Right now, it's a bit messy and not quite clear which code
is responsible for which fields. Sometimes a field is set by indexam,
but then one random place in nbtsearch.c sets it too, etc.

Finally, two things that might be an issue / I'm not quite sure about.

Firstly, do we need to support mixing batched and non-batched calls?
That is, given an index scan, should it be possible to interleave calls
to index_getnext_tid and index_batch_getnext/index_batch_getnext_tid?

I'm pretty sure that doesn't work, at least not right now. Because with
batching the index AM does not have an exact idea "where" on the page we
actually are / which item is "current". I believe it might be possible
to improve this by "synchronizing" whenever we switch between the two
approaches. But I'm not sure it's something we need/want to support. I
can't quite imagine why would I need this.

The other thing is mark/restore. At the moment this does not work, for
pretty much the same reason - the index AM has no idea what's the exact
"current" item on the page, so mark/restore does unexpected things. In
the patch I "fixed" this by disabling batching/prefetching for plans
with EXEC_FLAG_MARK, so e.g. mergejoins won't benefit from this.

It did seem like an acceptable limitation to me, but now that I think
about it, if we could "synchronize" the position from the batch (if the
index AM requests it), I think this might work correctly.

I'm yet to do a comprehensive benchmark, but the tests I've done during
development suggest the gains are in line with what we saw for the
earlier versions.

regards

--
Tomas Vondra

Attachment Content-Type Size
v20240831-0001-WIP-index-batching-prefetching.patch text/x-patch 77.3 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Harris 2024-09-01 01:31:48 Re: ANALYZE ONLY
Previous Message Andrei Lepikhov 2024-08-31 19:33:23 Re: type cache cleanup improvements