Quick Links

pgsql: Further optimize nbtree search scan key comparisons.

From:	Peter Geoghegan <pg(at)bowt(dot)ie>
To:	pgsql-committers(at)lists(dot)postgresql(dot)org
Subject:	pgsql: Further optimize nbtree search scan key comparisons.
Date:	2025-04-04 16:28:54
Message-ID:	E1u0juY-002ezY-2Q@gemulon.postgresql.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-committers

Further optimize nbtree search scan key comparisons.

Postgres 17 commit e0b1ee17 added two complementary optimizations to
nbtree: the "prechecked" and "firstmatch" optimizations. _bt_readpage
was made to avoid needlessly evaluating keys that are guaranteed to be
satisfied by applying page-level context. "prechecked" did this for
keys required in the current scan direction, while "firstmatch" did it
for keys required in the opposite-to-scan direction only.

The "prechecked" design had a number of notable issues. It didn't
account for the fact that an = array scan key's sk_argument field might
need to advance at the point of the page precheck (it didn't check the
precheck tuple against the key's array, only the key's sk_argument,
which needlessly made it ineffective in cases involving stepping to a
page having advanced the scan's arrays using a truncated high key).
"prechecked" was also completely ineffective when only one scan key
wasn't guaranteed to be satisfied by every tuple (it didn't recognize
that it was still safe to avoid evaluating other, earlier keys).

The "firstmatch" optimization had similar limitations. It could only be
applied after _bt_readpage found its first matching tuple, regardless of
why any earlier tuples failed to satisfy the scan's index quals. This
allowed unsatisfied non-required scan keys to impede the optimization.

Replace both optimizations with a new optimization, without any of these
limitations: the "startikey" optimization. Affected _bt_readpage calls
generate a page-level key offset ("startikey"), that their _bt_checkkeys
calls can then start at. This is an offset to the first key that isn't
known to be satisfied by every tuple on the page.

Although this is independently useful work, its main goal is to avoid
performance regressions with index scans that use skip arrays, but still
never manage to skip over irrelevant leaf pages. We must avoid wasting
CPU cycles on overly granular skip array maintenance in these cases.
The new "startikey" optimization helps with this by selectively
disabling array maintenance for the duration of a _bt_readpage call.
This has no lasting consequences for the scan's array keys (they'll
still reliably track the scan's progress through the index's key space
whenever the scan is "between pages").

Skip scan adds skip arrays during preprocessing using simple, static
rules, and decides how best to navigate/apply the scan's skip arrays
dynamically, at runtime. The "startikey" optimization enables this
approach. As a result of all this, the planner doesn't need to generate
distinct, competing index paths (one path for skip scan, another for an
equivalent traditional full index scan). The overall effect is to make
scan runtime close to optimal, even when the planner works off an
incorrect cardinality estimate. Scans will also perform well given a
skipped column with data skew: individual groups of pages with many
distinct values (in respect of a skipped column) can be read about as
efficiently as before -- without the scan being forced to give up on
skipping over other groups of pages that are provably irrelevant.

Many scans that cannot possibly skip will still benefit from the use of
skip arrays, since they'll allow the "startikey" optimization to be as
effective as possible (by allowing preprocessing to mark all the scan's
keys as required). A scan that uses a skip array on "a" for a qual
"WHERE a BETWEEN 0 AND 1_000_000 AND b = 42" is often much faster now,
even when every tuple read by the scan has its own distinct "a" value.
However, there are still some remaining regressions, affecting certain
trickier cases.

Scans whose index quals have several range skip arrays, each on some
high cardinality column, can still be slower than they were before the
introduction of skip scan -- even with the new "startikey" optimization.
There are also known regressions affecting very selective index scans
that use a skip array. The underlying issue with such selective scans
is that they never get as far as reading a second leaf page, and so will
never get a chance to consider applying the "startikey" optimization.
In principle, all regressions could be avoided by teaching preprocessing
to not add skip arrays whenever they aren't expected to help, but it
seems best to err on the side of robust performance.

Follow-up to commit 92fe23d9, which added nbtree skip scan.

Author: Peter Geoghegan <pg(at)bowt(dot)ie>
Reviewed-By: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
Reviewed-By: Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Reviewed-By: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Discussion: https://postgr.es/m/CAH2-Wz=Y93jf5WjoOsN=xvqpMjRy-bxCE037bVFi-EasrpeUJA@mail.gmail.com
Discussion: https://postgr.es/m/CAH2-WznWDK45JfNPNvDxh6RQy-TaCwULaM5u5ALMXbjLBMcugQ@mail.gmail.com

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/8a510275dd6b343b8e1646e4754078eab8ef3ab5

Modified Files
--------------
src/backend/access/nbtree/nbtpreprocesskeys.c | 1 +
src/backend/access/nbtree/nbtree.c | 1 +
src/backend/access/nbtree/nbtsearch.c | 77 ++--
src/backend/access/nbtree/nbtutils.c | 546 +++++++++++++++++++++-----
src/include/access/nbtree.h | 11 +-
5 files changed, 483 insertions(+), 153 deletions(-)

Browse pgsql-committers by date

	From	Date	Subject
Next Message	Jeff Davis	2025-04-04 17:29:00	pgsql: Oversight in commit b81ffa13e3.
Previous Message	Alexander Korotkov	2025-04-04 16:03:21	Re: pgsql: Convert 'x IN (VALUES ...)' to 'x = ANY ...' then appropriate