PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)
Date: 2024-08-26 16:06:04
Message-ID: c660f36c-f648-47f1-ad97-a362432451e4@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I'm getting back to work on the index prefetching patch [1], but one
annoying aspect of that patch is that it's limited to the context of a
single executor node. It can be very effective when there's an index
scan with many matches for a key, but it's entirely useless for plans
with many tiny index scans.

For example, consider a plan like:

Nested Loop
-> ... some scan of a "fact" table ...
-> Index Scan on PK of a "dimension" table

For this the index prefetching is entirely useless - there'll be just
one match for each outer row.

But there still is opportunity for prefetching - we could look ahead in
the outer relation (which can be arbitrary node feeding the nestloop),
and request the index scan to prefetch the matching tuples.

Of course, this is not something the nestloop can do on it's own, it
would require support from the executor. For some nodes prefetching is
pretty straightforward (e.g. index scan), for other nodes it can be
impossible or at least very hard / too expensive.

I was a bit bored over the weekend, so I decided to experiment a bit and
see how difficult would this be, and how much could it gain. Attached is
an early PoC version of that patch - it's very limited (essentially just
NL + inner index scan), but it seems to be working. It's only about 250
insertions, to tiny.

The patch does this:
--------------------

1) ExecPrefetch executor callback

This call is meant to do the actual prefetching - the parent node sets
everything up almost as if for ExecProcNode(), but does not expect the
actual result. The child either does some prefetching or nothing.

2) ExecSupportsPrefetch to identify what nodes accept ExecPrefetch()

This simply says if a given node supports prefetching. The only place
calling this is the nested loop, to enable prefetching only for nested
loops with (parameterized) index scans.

3) ExecPrefetchIndexScan doing prefetching in index scans

This is just trivial IndexNext() variant, getting TIDs and calling
PrefetchBuffer() on them. Right now it just prefetches everything, but
that's seems wrong - this is where the original index prefetching patch
should kick in.

4) ExecNestLoop changes

This is where the actual magic happens - if the inner child knows how to
prefetch stuff (per ExecSupportsPrefetch), this switches to reading
batches of outer slots, and calls ExecPrefetch() on them. Right now the
batch size is hardcoded to 32, but it might use effective_io_concurrency
or something like that. It's a bit naive in other aspects too - it
always reads and prefetches the whole batch at once, instead of ramping
up and then consuming and prefetching slots one by one. Good enough for
PoC, but probably needs improvements.

5) adds enable_nestloop_prefetch to enable/disable this easily

benchmark
---------

Of course, the main promise of this is faster queries, so I did a simple
benchmark, with a query like this:

SELECT * FROM fact_table f JOIN dimension d ON (f.id = d.id)
WHERE f.r < 0.0001;

The "r" is simply a random value, allowing to select arbitrary fraction
of the large fact table "f". Here it selects 0.01%, so ~10k rows from
100M table. Dimensions have 10M rows. See the .sql file attached.

For a variable number of dimensions (1 to 4) the results look like this:

prefetch 1 2 3 4
----------------------------------------
off 3260 6193 8993 11980
on 2145 3897 5531 7352
----------------------------------------
66% 63% 62% 61%

This is on "cold" data, with a restart + drop caches between runs. The
results suggest the prefetching makes it about twice as fast. I was
hoping for more, but not bad for a Poc, chances are it can be improved.

I just noticed there's a couple failures in the regression tests, if I
change the GUC to "true" by default. I haven't looked into that yet, but
I guess there's some mistake in resetting the child node, or something
like that. Will investigate.

What I find a bit annoying is the amount of processing required to
happen twice - once for the prefetching, once for the actual execution.
In particular, this needs to set up all the PARAM_EXEC slots as if the
inner plan was to do regular execution.

The other thing is that while ExecPrefetchIndexScan() only prefetches
the heap page, it still needs to navigate the index to the leaf page. If
the index is huge, that may require I/O. But we'd have to pay that cost
shortly after anyway. It just isn't asynchronous.

One minor detail is that I ran into some issues with tuple slots. I need
a bunch of them to stash the slots received from the outer plan, so I
created a couple slots with TTSOpsVirtual. And that mostly works, except
that later ExecInterpExpr() happens to call slot_getsomeattrs() and that
fails because tts_virtual_getsomeattrs() says:

elog(ERROR, "getsomeattrs is not required to be called on a virtual
tuple table slot");

OK, that call is not necessary for virtual slots, it's noop. But it's
not me calling that, the interpreter does that for some reason. I did
comment that error out in the patch, but I wonder what's the proper way
to make this work ...

regards

[1]
https://www.postgresql.org/message-id/cf85f46f-b02f-05b2-5248-5000b894ebab%40enterprisedb.com

--
Tomas Vondra

Attachment Content-Type Size
0001-nestloop-prefetch-initial-PoC.patch text/x-patch 13.1 KB
nestloop.sql application/sql 942 bytes
nestloop.csv text/csv 4.4 KB
nestloop.sh application/x-shellscript 904 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-08-26 16:32:53 allowing extensions to control planner behavior
Previous Message mr.trubach 2024-08-26 16:03:46 [PATCH] Support systemd readiness notifications on reload