Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date: 2024-08-09 21:13:28
Message-ID: CAH2-Wz=9A_UtM7HzUThSkQ+BcrQsQZuNhWOvQWK06PRkEp=SKQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 24, 2024 at 5:14 PM Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> Attached is v4

Attached is v5, which splits the code from v4 patch into 2 pieces --
it becomes 0002-* and 0003-*. Certain refactoring work now appears
under its own separate patch/commit -- see 0002-* (nothing new here,
except the commit message/patch structure). The patch that actually
adds skip scan (0003-* in this new version) has been further polished,
though not in a way that I think is interesting enough to go into
here.

The interesting and notable change for v5 is the addition of the code
in 0001-*. The new 0001-* patch is concerned with certain aspects of
how _bt_advance_array_keys decides whether to start another primitive
index scan (or to stick with the ongoing one for one more leaf page
instead). This is a behavioral change, albeit a subtle one. It's also
kinda independent of skip scan (more on why that is at the end).

It's easiest to explain why 0001-* matters by way of an example. My
example will show significantly more internal/root page accesses than
seen on master, though only when 0002-* and 0003-* are applied, and
0001-* is omitted. When all 3 v5 patches are applied together, the
total number of index pages accessed by the test query will match the
master branch. It's important that skip scan never loses by much to
the master branch, of course. Even when the details of the index/scan
are inconvenient to the implementation, in whatever way.

Setup:

create table demo (int4 a, numeric b);
create index demo_idx on demo (a, b);
insert into demo select a, random() from generate_series(1, 10000) a,
generate_series(1,5) five_rows_per_a_val;
vacuum demo;

We now have a btree index "demo_idx", which has two levels (a root
page plus a leaf level). The root page contains several hundred pivot
tuples, all of which have their "b" value truncated away (or have the
value -inf, if you prefer), with just one prefix "a" column left in
place. Naturally, every leaf page has a high key with its own
separator key that matches one particular tuple that appears in the
root page (except for the rightmost leaf page). So our leaf level scan
will see lots of truncated leaf page high keys (all matching a
corresponding root page tuple).

Test query:

select a from demo where b > 0.99;

This is a query that really shouldn't be doing any skipping at all. We
nevertheless still see a huge amount of skipping with this query, ocne
0001-* is omitted. Prior to 0001-*, a new primitive index scan is
started whenever the scan reaches a "boundary" between adjoining leaf
pages. That is, whenever _bt_advance_array_keys stopped on a high key
pstate.finaltup. So without the new 0001-* work, the number of page
accesses almost doubles (because we access the root page once per leaf
page accessed, instead of just accessing it once for the whole scan).

What skip scan should have been doing all along (and will do now) is
to step forward to the next right sibling leaf page whenever it
reaches a boundary between leaf pages. This should happen again and
again, without our ever choosing to start a new primitive index scan
instead (it shouldn't happen even once with this query). In other
words, we ought to behave just like a full index scan would behave
with this query -- which is exactly what we get on master.

The scan will still nominally "use skip scan" even with this fix in
place, but in practice, for this particular query/index, the scan
won't ever actually decide to skip. So it at least "looks like" an
index scan from the point of view of EXPLAIN (ANALYZE, BUFFERS). There
is a separate question of how many CPU cycles we use to do all this,
but for now my focus is on total pages accessed by the patch versus on
master, especially for adversarial cases such as this.

It should be noted that the skip scan patch never had any problems
with this very similar query (same table as before):

select a from demo where b < 0.01;

The fact that we did the wrong thing for the first query, but the
right thing for this second similar query, was solely due to certain
accidental implementation details -- it had nothing to do with the
fundamentals of the problem. You might even say that 0001-* makes the
original "b > 0.99" case behave in the same manner as this similar "b
< 0.01" case, which is justifiable on consistency grounds. Why
wouldn't these two cases behave similarly? It's only logical.

The underlying problem arguably has little to do with skip scan;
whether we use a real SAOP array on "a" or a consed up skip array is
incidental to the problem that my example highlights. As always, the
underlying "array type" (skip vs SOAP) only matters to the lowest
level code. And so technically, this is an existing issue on
HEAD/master. You can see that for yourself by making the problematic
query's qual "where a = any ('every possible a value') and b > 0.99"
-- same problem on Postgres 17, without involving skip scan.

To be sure, the underlying problem does become more practically
relevant with the invention of skip arrays for skip scan, but 0001-*
can still be treated as independent work. It can be committed well
ahead of the other stuff IMV. The same is likely also true of the
refactoring now done in 0002-* -- it does refactoring that makes
sense, even without skip scan. And so I don't expect it to take all
that long for it to be committable.

--
Peter Geoghegan

Attachment Content-Type Size
v5-0001-Normalize-nbtree-truncated-high-key-array-behavio.patch application/octet-stream 13.1 KB
v5-0003-Add-skip-scan-to-nbtree.patch application/octet-stream 115.5 KB
v5-0002-Refactor-handling-of-nbtree-array-redundancies.patch application/octet-stream 18.6 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Heikki Linnakangas 2024-08-09 21:13:37 Re: Refactoring postmaster's code to cleanup after child exit
Previous Message Masahiko Sawada 2024-08-09 20:48:27 Re: Fix memory counter update in reorderbuffer