From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Subject: | Re: Bug in nbtree optimization to skip > operator comparisons (or < comparisons in backwards scans) |
Date: | 2023-12-06 18:54:45 |
Message-ID: | CAH2-Wzn+842AywNUcssXRijjHdHCVAvEWjKwicaQTfUDPy==bA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Dec 6, 2023 at 5:27 AM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> 1. When scanning an index in ascending order using scankey a > 1 (so,
> one that defines a start point of the scan), we don't need to check
> items for consistency with that scankey once we've found the first
> value that is consistent with the scankey, as all future values will
> also be consistent with the scankey (if we assume no concurrent page
> deletions).
BTW, I don't think that page deletion is a concern for these
optimizations in the way that it is for the similar idea of "dynamic
prefix compression", which works against insertion-type scan keys
(used to descend the tree and to do an initial binary search of a leaf
page).
We already rely on the first call to _bt_readpage (the one that takes
place from within _bt_first rather than from _bt_next) passing a page
offset number that's exactly at the start of where matches begin --
this is crucial in the case of scans with required equality strategy
scan keys (most scans). If we just skipped the _bt_binsrch and passed
P_FIRSTDATAKEY(opaque) to _bt_readpage within _bt_first instead, that
would break lots of queries. So the _bt_binsrch step within _bt_first
isn't just an optimization -- it's crucial. This is nothing new.
Recall that _bt_readpage only deals with search-type scan keys,
meaning scan keys that use a simple operator (so it uses = operators
with the equality strategy, as opposed to using a 3-way ORDER
proc/support function 1 that can tell the difference between < and >).
In general _bt_readpage doesn't know how to tell the difference
between a tuple that's before the start of = matches, and a tuple
that's at (or after) the end of any = matches. If it is ever allowed
to conflate these two cases, then we'll overlook matching tuples,
which is of course wrong (it'll terminate the scan before it even
starts). It is up to the caller (really just _bt_first) to never call
_bt_readpage in a way that allows this confusion to take place --
which is what makes the _bt_binsrch step crucial.
A race condition with page deletion might allow the key space covered
by a leaf page to "widen" after we've left its parent, but before we
arrive on the leaf page. But the _bt_binsrch step within _bt_first
happens *after* we land on and lock that leaf page, in any case. So
there is no risk of the scan ever doing anything with
concurrently-inserted index tuples. In general we only have to worry
about such race conditions when descending the tree -- they're not a
problem after the scan has reached the leaf level and established an
initial page offset number. (The way that _bt_readpage processes whole
pages in one atomic step helps with this sort of thing, too. We can
almost pretend that the B-Tree structure is immutable, even though
that's obviously not really true at all. We know that we'll always
make forward progress through the key space by remembering the next
page to visit when processing each page.)
My test case broke the required-in-opposite-direction-only
optimization by finding a way in which
required-in-opposite-direction-only inequalities were not quite the
same as required equalities with respect to this business about the
precise leaf page offset number that the scan begins at. They make
*almost* the same set of guarantees (note in particular that both will
be transformed into insertion scan key/3-way ORDER proc scan keys by
_bt_first's initial positioning code), but there is at least one
special case that applies only to inequalities. I had to play games
with weird incomplete opfamilies to actually break the optimization --
that was required to tickle the special case in just the right/wrong
way.
--
Peter Geoghegan
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Verite | 2023-12-06 18:59:11 | Re: Emitting JSON to file using COPY TO |
Previous Message | Robert Haas | 2023-12-06 18:52:59 | Re: generic plans and "initial" pruning |