Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date: 2025-03-17 22:51:25
Message-ID: CAEze2WhiSkqifvZqrJQNgqktejJMb_-7pTWQUETN1w8a=h_-Yg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, 11 Mar 2025 at 16:53, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
>
> On Sat, Mar 8, 2025 at 11:43 AM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> > I plan on committing this one soon. It's obviously pretty pointless to
> > make the BTMaxItemSize operate off of a page header, and not requiring
> > it is more flexible.
>
> Committed. And committed a revised version of "Show index search count
> in EXPLAIN ANALYZE" that addresses the issues with non-parallel-aware
> index scan executor nodes that run from a parallel worker.
>
> Attached is v28. This is just to keep the patch series applying
> cleanly -- no real changes here.

You asked off-list for my review of 0003. I'd already reviewed 0001
before that, so that review also included. I'll see if I can spend
some time on the other patches too, but for 0003 I think I got some
good consistent feedback.

0001:

> src/backend/access/nbtree/nbtsearch.c
> _bt_readpage

This hasn't changed meaningfully in this patch, but I noticed that
pstate.finaltup is never set for the final page of the scan direction
(i.e. P_RIGHTMOST or P_LEFTMOST for forward or backward,
respectively). If it isn't used more than once after the first element
of non-P_RIGHTMOST/LEFTMOST pages, why is it in pstate? Or, if it is
used more than once, why shouldn't it be used in

Apart from that, 0001 looks good to me.

0003:

> _bt_readpage

In forward scan mode, recovery from forcenonrequired happens after the
main loop over all page items. In backward mode, it's in the loop:

> + if (offnum == minoff && pstate.forcenonrequired)
> + {
> + Assert(so->skipScan);

I think there's a comment missing that details _why_ we do this;
probably something like:

/*
* We're about to process the final item on the page.
* Un-set forcenonrequired, so the next _bt_checkkeys will
* evaluate required scankeys and signal an end to this
* primitive scan if we've reached a stopping point.
*/

In line with that, could you explain a bit more about the
pstate.forcenonrequired optimization? I _think_ it's got something to
do with "required" scankeys adding some overhead per scankey, which
can be significant with skipscan evaluations and ignoring the
requiredness can thus save some cycles, but the exact method doesn't
seem to be very well articulated.

> _bt_skip_ikeyprefix

I _think_ it's worth special-casing firstchangingattnum=1, as in that
case we know in advance there is no (immediate) common ground between
the index tuples and thus any additional work we do towards parsing
the scankeys would be wasted - except for matching inequality bounds
for firstchangingatt, or matching "open" skip arrays for a prefix of
attributes starting at firstchangingattnum (as per the
array->null_elem case).

I also notice somed some other missed opportunities for optimizing
page accesses:

> + if (key->sk_strategy != BTEqualStrategyNumber)

The code halts optimizing "prefix prechecks" when we notice a
non-equality key. It seems to me that we can do the precheck on shared
prefixes with non-equality keys just the same as with equality keys;
and it'd improve performance in those cases, too.

> + if (!(key->sk_flags & SK_SEARCHARRAY))
> + if (key->sk_attno < firstchangingattnum)
> + {
> + if (result == 0)
> + continue; /* safe, = key satisfied by every tuple */
> + }
> + break; /* pstate.ikey to be set to scalar key's ikey */

This code finds out that no tuple on the page can possibly match the
scankey (idxtup=scalar returns non-0 value) but doesn't (can't) use it
to exit the scan. I think that's a missed opportunity for
optimization; now we have to figure that out for every tuple in the
scan. Same applies to the SAOP -array case (i.e. non-skiparray).

Thank you for working on this.

Kind regards,

Matthias van de Meent
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2025-03-17 23:01:25 Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
Previous Message Tomas Vondra 2025-03-17 22:13:06 Re: Proposal: Adding compression of temporary files