Re: index prefetching

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Melanie Plageman <melanieplageman(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Georgios <gkokolatos(at)protonmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Konstantin Knizhnik <knizhnik(at)garret(dot)ru>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Subject: Re: index prefetching
Date: 2024-09-06 21:49:43
Message-ID: accd03eb-0379-416d-9936-41a4de3c47ef@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

here's an updated version of this patch series, with a couple major
improvements:

1) adding batching/prefetching to relevant built-in index AMs

This means btree, hash, gist and sp-gist, i.e. index types that can
return tuples. For gin/brin it's irrelevant (it'd be more correct to
explicitly set amgetbatch to null, I guess).

Anyway, those patches are fairly small, maybe 10kB each, with 150-300
new lines. And the patches are pretty similar, thanks to the fact that
all the index AMs mirror btree (especially hash).

The main differences are in ordered scans in gist/spgist, where the
approach is quite different, but not that much. There's also the
business of returning orderbyvals/orderbynulls, and index-only scans,
but that should work too, now.

2) simplify / cleanup of the btree batching

There was a lot of duplication and copy-pasted code in the functions
that load the first/next batch, this version gets rid of that and
replaces this "common" code with _bt_copy_batch() utility function. The
other index AMs have pretty much the same thing, but adjusted for the
scan opaque struct specific for that index type.

I'm not saying it's perfect as it is, but it's way better, IMHO.

3) making mark/restore work for btree

This was one of the main limitations - the patch simply disabled
batching for plans requiring EXEC_FLAG_MARK, because of issues with
determining the correct position on the page in markpos(). I suggested
it should be possible to make this work by considering the batch index
in those calls, and restoring the proper batch in restrpos(), and this
updated patch does exactly that.

I haven't done any performance evaluation if batching helps in these
plans - if we restore to a position we already visited, we may not need
to prefetch those pages, it might even make things slow. Need some more
thinking, I guess.

Also, I'm not quite happy with how the two layers interact. The index AM
should not know this much the implementation details of batching, so I
plan to maybe replace those accesses with a function in indexam.c, or
something like that.

It's still a bit rough, so I kept it in a separate patch.

This now passes "make check-world" with asserts, valgrind and all that.
I still need to put it through some stress testing and benchmarking to
see how it performs.

The layering still needs some more work. I've been quite unhappy with
how how much the index AM needs to know about the "implementation
details" of the batching, and how unclear it was which layer manages
which fields. I think it's much better now - the goal is that:

* indexam.c updates the scandesc->xs_batch fields, and knows nothing
about the internal state of the index AM

* the AM can read scandesc->xs_batch data (perhaps by a function in
indexam.c), but never updates it

There are still a couple places where this is violated (e.g. in the
btrestrpos which manipulates the batch index directly), but I believe
that's fairly easy to solve.

Finally, I wrote that the basic contract that makes this possible is
"batch should never span multiple leaf pages". I realized that's
actually not quite correct - it's perfectly fine for the AM to return
batches spanning multiple leaf pages, as long as the AM knows to also
keep all the resources (pins, ...) until the next batch is requested.

It would also need to know how to handle kill_prior_tuples (which we now
accumulate per batch, and process before returning the next one), and
stuff like that.

It's just that with the restriction that a batch must not span multiple
leaf pages, it's fairly trivial to make this work. The changes required
by the current AM code are very limited, as demonstrated by the patches
adding this to gist/spgist/hash.

I can imagine the AMs being improved in this direction in the future. We
already have a place to keep track of this extra info - the scan opaque
struct. The AM could keep information about all the resources needed by
the last batch - in a way, we already do that, except that we need only
exactly the same resources as for regular non-batched scans.

Thinking about this a bit more, we'd probably want to allow multiple
in-flight batches. One of the shortcomings of the current approach with
a single batch is that as we're getting close to the end of the batch,
we can't issue prefetches. Only after we're done with that batch, we can
prefetch more pages. Essentially, there are "pipeline stall". I imagine
we could allow reading "future" batches so that we can issue prefetches,
and then eventually we'd process those.

But that would also require some ability to inform the index AM which
batches are no longer needed, and can be de-allocated. Hmmm, perhaps it
would be possible to make this work with just two batches, as long as
they are sized for the proper prefetch distance.

In any case, that would be a future patch. I'm only mentioning this to
clarify that I believe the proposed approach does not really have the
"single leaf page" restriction (the AM can do whatever it wants). And
that it could even be extended to handle multiple batches.

regards

--
Tomas Vondra

Attachment Content-Type Size
v20240906-0001-WIP-index-batching-prefetching.patch text/x-patch 91.3 KB
v20240906-0002-PoC-support-for-mark-restore.patch text/x-patch 12.2 KB
v20240906-0003-WIP-batching-for-hash-indexes.patch text/x-patch 14.3 KB
v20240906-0004-WIP-batching-for-gist-indexes.patch text/x-patch 11.7 KB
v20240906-0005-WIP-batching-for-sp-gist-indexes.patch text/x-patch 7.0 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2024-09-06 21:59:37 Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation
Previous Message Masahiko Sawada 2024-09-06 21:07:10 Re: pg_trgm comparison bug on cross-architecture replication due to different char implementation