Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>, Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date: 2025-03-18 19:15:20
Message-ID: CAH2-Wz=G+amASrOh3FJ+T1N0B_2bhAKbB5aq-f3YUCXk-sVVLQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 17, 2025 at 6:51 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> This hasn't changed meaningfully in this patch, but I noticed that
> pstate.finaltup is never set for the final page of the scan direction
> (i.e. P_RIGHTMOST or P_LEFTMOST for forward or backward,
> respectively). If it isn't used more than once after the first element
> of non-P_RIGHTMOST/LEFTMOST pages, why is it in pstate? Or, if it is
> used more than once, why shouldn't it be used in

We don't set pstate.finaltup on either the leftmost or rightmost page
(nothing new, this is all from the Postgres 17 SAOP patch) because we
cannot possibly need to start another primitive index scan from that
point. We only use pstate.finaltup to determine if we should start a
new primitive index scan, every time the scan's array keys advance
(except during "sktrig_required=false" array advancement, and barring
the case where we don't have to check pstate.finaltup because we see
that the new set of array keys are already an exact match for the
tuple passed to _bt_advance_array_keys). In short, pstate.finaltup is
used during a significant fraction of all calls to
_bt_advance_array_keys, and there is no useful limit on the number of
times that pstate.finaltup can be used per _bt_readpage.

Although you didn't say it in your email (you said it to me on our
most recent call), I believe that you're asking about pstate.finaltup
for reasons that are more related to 0003-* than to 0001-*. As I
recall, you asked me about why it was that the 0003-* patch avoids
_bt_readpage's call to _bt_skip_ikeyprefix on the leftmost or
rightmost leaf page (i.e. a page that lacks a pstate.finaltup). You
were skeptical about that -- understandably so.

To recap, 0003-* avoids calling _bt_skip_ikeyprefix on the rightmost
(and leftmost) page because it reasons that activating that
optimization is only okay when we know that we'll be able to "recover"
afterwards, during the finaltup _bt_checkkeys call. By "recover", I
mean restore the invariant for required array keys: that they track
our progress through the key space. It felt safer and easier for
0003-* to just not call _bt_skip_ikeyprefix on a page that has no
finaltup -- that way we won't have anything that we need to recover
from, when we lack the pstate.finaltup that we generally expect to use
for this purpose. This approach allowed me to add the following
assertions a little bit later on, right at the end of _bt_readpage
(quoting 0003-* here):

@@ -1993,6 +2031,16 @@ _bt_readpage(IndexScanDesc scan, ScanDirection
dir, OffsetNumber offnum,
so->currPos.itemIndex = MaxTIDsPerBTreePage - 1;
}

+ /*
+ * As far as our caller is concerned, the scan's arrays always track its
+ * progress through the index's key space.
+ *
+ * If _bt_skip_ikeyprefix told us to temporarily treat all scan keys as
+ * nonrequired (during a skip scan), then we must recover afterwards by
+ * advancing our arrays using finaltup (with !pstate.forcenonrequired).
+ */
+ Assert(pstate.ikey == 0 && !pstate.forcenonrequired);
+
return (so->currPos.firstItem <= so->currPos.lastItem);
}

I've actually come around to your point of view on this (or what I
thought was your PoV from our call). That is, I now *think* that it
would be better if the code added by 0003-* called
_bt_skip_ikeyprefix, without regard for whether or not we'll have a
finaltup _bt_checkkeys call to "recover" (i.e. whether we're on the
leftmost or rightmost page shouldn't matter).

My change in perspective on this question is related to another change
of perspective, on the question of whether we actually need to call
_bt_start_array_keys as part of "recovering/restoring the array
invariant", just ahead of the finaltup _bt_checkkeys call. As you
know, 0003-* calls _bt_start_array_keys in this way, but that now
seems like overkill. It can have undesirable side-effects when the
array keys spuriously appear to advance, when in fact they were
restarted via the _bt_start_array_keys call, only to have their
original values restored via the finaltup call that immediately
follows.

Here's what I mean about side-effects:

We shouldn't allow _bt_advance_array_keys' primitive index scan
scheduling logic to falsely believe that the array keys have advanced,
when they didn't really advance in any practical sense. The commit
message of 0003-* promises that its optimization cannot influence
primitive index scan scheduling at all, which seems like a good
general principle for us to follow. But I recently discovered that
that promise was subtly broken, which I tied to the way that 0003-*
calls _bt_start_array_keys (this didn't produce wrong answers, and
didn't even really hurt performance, but it seems wonky and hard to
test). So I now think that I need to expect more from
_bt_skip_ikeyprefix and its pstate.forcenonrequired optimization, so
that my promise about primscan scheduling is honored.

Tying it back to your concern, once I do that (once I stop calling
_bt_start_array_keys in 0003-* to "hard reset" the arrays), I can also
stop caring about finaltup being set on the rightmost page, at the
point where we decide if _bt_skip_ikeyprefix should be called.

Here's how I think that this will be safe:

Obviously, we can't expect _bt_skip_ikeyprefix/pstate.forcenonrequired
mode to maintain the scan's required arrays in the usual way -- the
whole idea in 0003-* is to stop properly maintaining the arrays, until
right at the end of the _bt_readpage call, so as to save cycles.
However, it should be possible to teach _bt_skip_ikeyprefix to not
leave the array keys in an irredeemably bad state -- even when no
finaltup call is possible during the same _bt_readpage. And, even when
the scan direction changes at an inconvenient time. The next revision
(v29) is likely to strengthen the guarantees that _bt_skip_ikeyprefix
makes about the state that it'll leave the scan's array keys in, so
that its _bt_readpage caller can be laxer about "fully undoing" its
call to _bt_skip_ikeyprefix. The invariants we need to restore only
really apply when the scan needs to continue in the same scan
direction, at least one more page.

In short, as long as the array keys can never "go backwards" (relative
to the current scan direction), then we'll be able to recover during
the next conventional call to _bt_checkkeys (meaning the next
!pstate.forcenonrequired call) -- even if that next call should happen
on some other page (in practice it is very unlikely that it will, but
we can still be prepared for that). While it's true that
_bt_advance_array_keys (with sktrig_required=true) always promises to
advance the array keys to the maximum possible extent that it can know
to be safe, based on the caller's tuple alone, that in itself doesn't
obligate _bt_readpage to make sure that _bt_advance_array_keys will be
called when the top-level scan is over. We have never expected
_bt_advance_array_keys to *reliably* reach the final set of array keys
at the end of the scan (e.g., this won't happen when the index is
completely empty, since we'll never call _bt_readpage in the first
place).

> In forward scan mode, recovery from forcenonrequired happens after the
> main loop over all page items. In backward mode, it's in the loop:
>
> > + if (offnum == minoff && pstate.forcenonrequired)
> > + {
> > + Assert(so->skipScan);
>
> I think there's a comment missing that details _why_ we do this;
> probably something like:
>
> /*
> * We're about to process the final item on the page.
> * Un-set forcenonrequired, so the next _bt_checkkeys will
> * evaluate required scankeys and signal an end to this
> * primitive scan if we've reached a stopping point.
> */

I think that the right place to talk about this is above
_bt_skip_ikeyprefix itself.

> In line with that, could you explain a bit more about the
> pstate.forcenonrequired optimization? I _think_ it's got something to
> do with "required" scankeys adding some overhead per scankey, which
> can be significant with skipscan evaluations and ignoring the
> requiredness can thus save some cycles, but the exact method doesn't
> seem to be very well articulated.

The main benefit of the _bt_skip_ikeyprefix optimization is that it
allows us to skip a pstate.ikey prefix of scan keys in many important
cases. But that is not compatible with certain existing
_bt_advance_array_keys "sktrig_required=true" optimizations.

Most notably, we cannot assume that the array keys perfectly track our
progress through the index's key space when calling
_bt_advance_array_keys with "sktrig_required=false". In particular, it
would be wrong to allow the SAOP array binary search
cur_elem_trig=true optimization (see _bt_binsrch_array_skey) to be
used. We also don't want to attempt to end the primscan during
"sktrig_required=false" calls to _bt_advance_array_keys (nothing about
that is new here, it's just that _bt_skip_ikeyprefix now temporarily
forces the scan to behave this way, dynamically, rather than it being
static behavior that is fixed for the whole scan).

The underlying problem is that "sktrig_required=false" array
advancement cannot precisely reason about the relationship between the
scan's progress and the current required array key positions. For
example, with a query "WHERE a BETWEEN 0 AND 100 AND b in (42, 44)",
on a page whose "a" attribute values all satisfy the range qual on "a"
(i.e. with pstate.ikey = 1), our "a" skip array won't advance at all
(if we didn't use the _bt_skip_ikeyprefix optimization then we'd only
ever do "sktrig_required=false" advancement, and the skip array might
advance several times within a page, but we're ignoring "a" here). We
cannot reuse work across "b" SAOP binary searches, because in general
we're not paying attention to "a" at all -- and so we won't even try
to roll over "b" when the value of "a" advances (we're just looking at
"b", never "a").

> > _bt_skip_ikeyprefix
>
> I _think_ it's worth special-casing firstchangingattnum=1, as in that
> case we know in advance there is no (immediate) common ground between
> the index tuples and thus any additional work we do towards parsing
> the scankeys would be wasted - except for matching inequality bounds
> for firstchangingatt, or matching "open" skip arrays for a prefix of
> attributes starting at firstchangingattnum (as per the
> array->null_elem case).

Not sure what you mean by "special-casing firstchangingattnum=1"? What
"additional work we do towards parsing the scankeys" are you concerned
about?

It's fairly common for firstchangingattnum=1, even when the
_bt_skip_ikeyprefix optimization is working well. For example, that's
what'd happen with the example query I just gave (a query "WHERE a
BETWEEN 0 AND 100 AND b in (42, 44)" can skip "a" by setting
pstate.ikey=1, provided all of the "a" attribute values on the page
are within the range of the skip array).

> I also notice somed some other missed opportunities for optimizing
> page accesses:
>
> > + if (key->sk_strategy != BTEqualStrategyNumber)
>
> The code halts optimizing "prefix prechecks" when we notice a
> non-equality key. It seems to me that we can do the precheck on shared
> prefixes with non-equality keys just the same as with equality keys;
> and it'd improve performance in those cases, too.

Yeah, I was thinking of doing this already (though not for RowCompare
inequalities, which would be hard to evaluate from here). It makes
sense because it's exactly the same case as the range skip array case,
really -- why not just do it the same way?

> > + if (!(key->sk_flags & SK_SEARCHARRAY))
> > + if (key->sk_attno < firstchangingattnum)
> > + {
> > + if (result == 0)
> > + continue; /* safe, = key satisfied by every tuple */
> > + }
> > + break; /* pstate.ikey to be set to scalar key's ikey */
>
> This code finds out that no tuple on the page can possibly match the
> scankey (idxtup=scalar returns non-0 value) but doesn't (can't) use it
> to exit the scan. I think that's a missed opportunity for
> optimization; now we have to figure that out for every tuple in the
> scan. Same applies to the SAOP -array case (i.e. non-skiparray).

Maybe, but it's not much of a missed opportunity. It doesn't guarantee
that the scan can end in the case of a SAOP (the very next leaf page
could satisfy the same scan key, given a SAOP array with "gaps" in the
elements). So it can only really end the scan with a scalar = key --
though never when it is preceded by a skip array (doesn't matter if
the skip array is definitely satisfied/has only one distinct attribute
value on the page). Is this idea related to your previous idea
involving "special-casing firstchangingattnum=1"?

If I was going to do something like this, I think that it'd work by
backing out of applying the optimization entirely. Right now, 0003-*
applies the optimization whenever _bt_readpage decides to call
_bt_skip_ikeyprefix, regardless of the details after that (it would be
easy to teach _bt_skip_ikeyprefix to decide against applying its
optimization entirely, but testing seems to show that it always makes
sense to go ahead when _bt_skip_ikeyprefix is called, regardless of
what _bt_skip_ikeyprefix sees on the page).

> Thank you for working on this.

Thanks for the review!

--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2025-03-18 19:17:41 Re: race condition in pg_class
Previous Message Nathan Bossart 2025-03-18 19:08:42 Re: optimize file transfer in pg_upgrade