Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)
Date: 2024-08-27 18:40:06
Message-ID: te22fesyqvvy5hg7yr2dpu46wywljnsasyyg56dc6h7q3sh5bb@5t5qruxytjyp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-08-26 18:06:04 +0200, Tomas Vondra wrote:
> I'm getting back to work on the index prefetching patch [1], but one
> annoying aspect of that patch is that it's limited to the context of a
> single executor node. It can be very effective when there's an index
> scan with many matches for a key, but it's entirely useless for plans
> with many tiny index scans.

Right.

> The patch does this:
> --------------------
>
> 1) ExecPrefetch executor callback
>
> This call is meant to do the actual prefetching - the parent node sets
> everything up almost as if for ExecProcNode(), but does not expect the
> actual result. The child either does some prefetching or nothing.
>
> 2) ExecSupportsPrefetch to identify what nodes accept ExecPrefetch()
>
> This simply says if a given node supports prefetching. The only place
> calling this is the nested loop, to enable prefetching only for nested
> loops with (parameterized) index scans.
>
> 3) ExecPrefetchIndexScan doing prefetching in index scans
>
> This is just trivial IndexNext() variant, getting TIDs and calling
> PrefetchBuffer() on them. Right now it just prefetches everything, but
> that's seems wrong - this is where the original index prefetching patch
> should kick in.
>
> 4) ExecNestLoop changes
>
> This is where the actual magic happens - if the inner child knows how to
> prefetch stuff (per ExecSupportsPrefetch), this switches to reading
> batches of outer slots, and calls ExecPrefetch() on them. Right now the
> batch size is hardcoded to 32, but it might use effective_io_concurrency
> or something like that. It's a bit naive in other aspects too - it
> always reads and prefetches the whole batch at once, instead of ramping
> up and then consuming and prefetching slots one by one. Good enough for
> PoC, but probably needs improvements.
>
> 5) adds enable_nestloop_prefetch to enable/disable this easily

Hm. Doing this via more executor tree traversals doesn't seem optimal, that's
not exactly free. And because the prefetching can be beneficial even if there
are nodes above the inner parametrized index node, we IMO would want to
iterate through multiple node levels.

Have you considered instead expanding the parameterized scan logic? Right now
nestloop passes down values one-by-one via PARAM_EXEC. What if we expanded
that to allow nodes, e.g. nestloop in this case, to pass down multiple values
in one parameter? That'd e.g. allow passing down multiple rows to fetch from
nodeNestloop.c to nodeIndexscan.c without needing to iterate over the executor
state tree. And it might be more powerful than just doing prefetching -
e.g. we could use one ScalarArrayOps scan in the index instead of doing a
separate scan for each of the to-be-prefetched values.

> benchmark
> ---------
>
> Of course, the main promise of this is faster queries, so I did a simple
> benchmark, with a query like this:
>
> SELECT * FROM fact_table f JOIN dimension d ON (f.id = d.id)
> WHERE f.r < 0.0001;
>
> The "r" is simply a random value, allowing to select arbitrary fraction
> of the large fact table "f". Here it selects 0.01%, so ~10k rows from
> 100M table. Dimensions have 10M rows. See the .sql file attached.
>
> For a variable number of dimensions (1 to 4) the results look like this:
>
> prefetch 1 2 3 4
> ----------------------------------------
> off 3260 6193 8993 11980
> on 2145 3897 5531 7352
> ----------------------------------------
> 66% 63% 62% 61%
>
> This is on "cold" data, with a restart + drop caches between runs. The
> results suggest the prefetching makes it about twice as fast. I was
> hoping for more, but not bad for a Poc, chances are it can be improved.

I think that's indeed a pretty nice win.

> One minor detail is that I ran into some issues with tuple slots. I need
> a bunch of them to stash the slots received from the outer plan, so I
> created a couple slots with TTSOpsVirtual. And that mostly works, except
> that later ExecInterpExpr() happens to call slot_getsomeattrs() and that
> fails because tts_virtual_getsomeattrs() says:
>
> elog(ERROR, "getsomeattrs is not required to be called on a virtual
> tuple table slot");
>
> OK, that call is not necessary for virtual slots, it's noop. But it's
> not me calling that, the interpreter does that for some reason. I did
> comment that error out in the patch, but I wonder what's the proper way
> to make this work ...

Hm, that's odd. slot_getsomeattrs() only calls tts_virtual_getsomeattrs() if
slot->tts_nvalid is smaller than the requested attnum. Which should never be
the case for a virtual slot. So I suspect there might be some bug leading to
wrong contents being stored in slots?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2024-08-27 18:53:28 Re: PoC: prefetching data between executor nodes (e.g. nestloop + indexscan)
Previous Message Andres Freund 2024-08-27 18:27:00 Re: remove adaptive spins_per_delay code