Quick Links

Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From:	Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To:	Peter Geoghegan <pg(at)bowt(dot)ie>, Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
Cc:	Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject:	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date:	2025-01-25 03:07:45
Message-ID:	aa55adf3-6466-4324-92e6-5ef54e7c3918@iki.fi
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 03/01/2025 21:43, Peter Geoghegan wrote:
> The newly revised "skipskip" optimization seems to get the regressions
> down to only a 5% - 10% increase in runtime across a wide variety of
> unsympathetic cases -- I'm now validating performance against a test
> suite based on the adversarial cases presented by Masahiro Ikeda on
> this thread. Although I think that I'll end up tuning the "skipskip"
> mechanism some more (I may have been too conservative in marginal
> cases that actually do benefit from skipping), I deem these
> regressions to be acceptable. They're only seen in the most
> unsympathetic cases, where an omitted leading column has groupings of
> no more than about 50 index tuples, making skipping pretty hopeless.

On my laptop, this is the worst case I could come up with:

create table skiptest as select g / 10 as a, g%10 as b from
generate_series(1, 10000000) g;
vacuum freeze skiptest;
create index on skiptest (a, b);

set enable_seqscan=off; set max_parallel_workers_per_gather=0;

\timing on

After repeating a few times, to warm the cache:

postgres=# select count(*) from skiptest where b=1;
count
---------
1000000
(1 row)

Time: 202.501 ms

And after 'set skipscan_prefix_cols=0':

select count(*) from skiptest where b=1;
count
---------
1000000
(1 row)

Time: 164.762 ms

EXPLAIN ANALYZE confirms that it uses an Index Only scan in both cases.

> I knew from the outset that the hardest part of this project would be
> avoiding regressions in highly unsympathetic cases. The regressions
> that are still there seem very difficult to minimize any further; the
> overhead that remains comes from the simple need to maintain the
> scan's skip arrays once per page, before leaving the page. Once a scan
> decides to apply the "skipskip" optimization, it tends to stick with
> it for all future leaf pages -- leaving only the overhead of checking
> the high key while advancing the scan's arrays. I've cut just about
> all that that I can reasonably cut from the hot code paths that are at
> issue with the regressed cases.

Hmm, looking at the code and profile with perf, a lot of code is
executed per tuple. Just function calls, passing arguments, checking
flags etc. I suspect you could shave off some cycles by structuring the
code in a more compiler and branch-prediction friendly way. Or just with
more aggressive inlining. Not sure what exactly to suggest, but the code
of _bt_readpage() and all the subroutines it calls is complicated.

Aside from the performance of those routines, it doesn't feel very
intuitive. _bt_checkkeys() not only checks the current keys, but it can
also read ahead tuples on the same page and update the array keys.

One little thing I noticed by stepping with debugger is that it calls
index_getattr() twice for the same tuple and attribute. First in
_bt_check_compare(), and if that sets *continuescan=false, again in
_bt_tuple_before_array_skeys(). The first index_getattr() call is
certainly pretty expensive because that's where you get the cache miss
on the tuple when scanning. Not sure if the second call matters much,
but it feels like a bad smell.

The comment on _bt_tuple_before_array_skeys() says:

> * readpagetup callers must only call here when _bt_check_compare already set
> * continuescan=false. We help these callers deal with _bt_check_compare's
> * inability to distinguishing between the < and > cases (it uses equality
> * operator scan keys, whereas we use 3-way ORDER procs)
That begs the question, why does _bt_check_compare() not call the 3-way
ORDER proc in the first place? That would avoid the overhead of another
call here.

Aside from these micro-optimizations, I wonder about the "look-ahead"
logic in _bt_checkkeys_look_ahead. It feels pretty simplistic. Could you
use Exponential Search
(https://en.wikipedia.org/wiki/Exponential_search) instead of a plain
binary search on the page?

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Re: Adding skip scan (including MDAM style range skip scan) to nbtree at 2025-01-03 19:43:45 from Peter Geoghegan

Responses

Re: Adding skip scan (including MDAM style range skip scan) to nbtree at 2025-01-25 03:38:21 from Peter Geoghegan

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Peter Geoghegan	2025-01-25 03:38:21	Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Previous Message	Jeff Davis	2025-01-25 01:48:44	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators