Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date: 2024-09-18 18:52:35
Message-ID: CAH2-WznKyHq_W7heu87z80EHyZepQeWbGuAZebcxZHvOXWCU-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Sep 18, 2024 at 7:36 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> Makes sense. I started with the testing before before even looking at
> the code, so it's mostly a "black box" approach. I did read the 1995
> paper before that, and the script generates queries with clauses
> inspired by that paper, in particular:

I think that this approach with black box testing is helpful, but also
something to refine over time. Gray box testing might work best.

> OK, understood. If it's essentially an independent issue (perhaps even
> counts as a bug?) what about correcting it on master first? Doesn't
> sound like something we'd backpatch, I guess.

What about backpatching it to 17?

As things stand, you can get quite contradictory counts of the number
of index scans due to irrelevant implementation details from parallel
index scan. It just looks wrong, particularly on 17, where it is
reasonable to expect near exact consistency between parallel and
serial scans of the same index.

> Seems like a bit of a mess. IMHO we should either divide everything by
> nloops (so that everything is "per loop", or not divide anything. My
> vote would be to divide, but that's mostly my "learned assumption" from
> the other fields. But having a 50:50 split is confusing for everyone.

My idea was that it made most sense to follow the example of
"Buffers:", since both describe physical costs.

Honestly, I'm more than ready to take whatever the path of least
resistance is. If dividing by nloops is what people want, I have no
objections.

> > It's worth having skip support (the idea comes from the MDAM paper),
> > but it's not essential. Whether or not an opclass has skip support
> > isn't accounted for by the cost model, but I doubt that it's worth
> > addressing (the cost model is already pessimistic).
> >
>
> I admit I'm a bit confused. I probably need to reread the paper, but my
> impression was that the increment/decrement is required for skipscan to
> work. If we can't do that, how would it generate the intermediate values
> to search for? I imagine it would be possible to "step through" the
> index, but I thought the point of skip scan is to not do that.

I think that you're probably still a bit confused because the
terminology in this area is a little confusing. There are two ways of
explaining the situation with types like text and numeric (types that
lack skip support). The two explanations might seem to be
contradictory, but they're really not, if you think about it.

The first way of explaining it, which focuses on how the scan moves
through the index:

For a text index column "a", and an int index column "b", skip scan
will work like this for a query with a qual "WHERE b = 55":

1. Find the first/lowest sorting "a" value in the index. Let's say
that it's "Aardvark".

2. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
returning some matches.

3. Find the next value after "Aardvark" in the index using a probe
like the one we'd use for a qual "WHERE a > 'Aardvark'". Let's say
that it turns out to be "Abacus".

4. Look for matches "WHERE a = 'Abacus' and b = 55"...

... (repeat these steps until we've exhaustively processed every
existing "a" value in the index)...

The second way of explaining it, which focuses on how the skip arrays
advance. Same query (and really the same behavior) as in the first
explanation:

1. Skip array's initial value is the sentinel -inf, which cannot
possibly match any real index tuple, but can still guide the search.
So we search for tuples "WHERE a = -inf AND b = 55" (actually we don't
include the "b = 55" part, since it is unnecessary, but conceptually
it's a part of what we search for within _bt_first).

2. Find that the index has no "a" values matching -inf (it inevitably
cannot have any matches for -inf), but we do locate the next highest
match. The closest matching value is "Aardvark". The skip array on "a"
therefore advances from -inf to "Aardvark".

3. Look for matches "WHERE a = 'Aardvark' and b = 55", possibly
returning some matches.

4. Reach tuples after the last match for "WHERE a = 'Aardvark' and b =
55", which will cause us to advance the array on "a" incrementally
inside _bt_advance_array_keys (just like it would if there was a
standard SAOP array on "a" instead). The skip array on "a" therefore
advances from "Aardvark" to "Aardvark" +infinitesimal (we need to use
sentinel values for this text column, which lacks skip support).

5. Look for matches "WHERE a = 'Aardvark'+infinitesimal and b = 55",
which cannot possibly find matches, but, again, can reposition the
scan as needed. We can't find an exact match, of course, but we do
locate the next closest match -- which is "Abacus", again. So the skip
array now advances from "Aardvark" +infinitesimal to "Abacus". The
sentinel values are made up values, but that doesn't change anything.
(And, again, we don't include the "b = 55" part here, for the same
reason as before.)

6. Look for matches "WHERE a = 'Abacus' and b = 55"...

...(repeat these steps as many times as required)...

In summary:

Even index columns that lack skip support get to "increment" (or
"decrement") their arrays by using sentinel values that represent -inf
(or +inf for backwards scans), as well as sentinels that represent
concepts such as "Aardvark" +infinitesimal (or "Zebra" -infinitesimal
for backwards scans, say). This scheme sounds contradictory, because
in one sense it allows every skip array to be incremented, but in
another sense it makes it okay that we don't have a type-specific way
to increment values for many individual types/opclasses.

Inventing these sentinel values allows _bt_advance_array_keys to reason about
arrays without really having to care about which kinds of arrays are
involved, their order relative to each other, etc. In a certain sense,
we don't really need explicit "next key" probes of the kind that the
MDAM paper contemplates, though we do still require the same index
accesses as a design with explicit accesses.

Does that make sense?

Obviously, if we did add skip support for text, it would be very
unlikely to help performance. Sure, one can imagine incrementing from
"Aardvark" to "Aardvarl" using dedicated opclass infrastructure, but
that isn't very helpful. You're almost certain to end up accessing the
same pages with such a scheme, anyway. What are the chances of an
index with a leading text column actually containing tuples matching
(say) "WHERE a = 'Aardvarl' and b = 55"? The chances are practically
zero. Whereas if the column "a" happens to use a discrete type such as
integer or date, then skip support is likely to help: there's a decent
chance that a value generated by incrementing the last value
(and I mean incrementing it for real) will find a real match when
combined with the user-supplied "b" predicate.

It might be possible to add skip support for text, but there wouldn't
be much point.

> Anyway, probably a good idea for extending the stress testing script.
> Right now it tests with "bigint" columns only.

Good idea.

> Hmmm, yeah. I think it'd be useful to explain this reasoning (assuming
> no correlation means pessimistic skipscan costing) in a comment before
> btcostestimate, or somewhere close.

Will do.

> Don't we do something similar elsewhere? For example, IIRC we do some
> adjustments when estimating grouping in estimate_num_groups(), and
> incremental sort had to deal with something similar too. Maybe we could
> learn something from those places ... (both from the good and bad
> experiences).

I'll make a note of that. Gonna focus on regressions for now.

> Right. I don't think I've been suggesting having a separate path, I 100%
> agree it's better to have this as an option for index scan paths.

Cool.

> Sure. With this kind of testing I don't know what I'm looking for, so I
> try to cover very wide range of cases. Inevitably, some of the cases
> will not test the exact subject of the patch. I think it's fine.

I agree. Just wanted to make sure that we were on the same page.

> I think it'd help if I go through the results and try to prepare some
> reproducers, to make it easier for you. After all, it's my script and
> you'd have to reverse engineer some of it.

Yes, that would be helpful.

I'll probably memorialize the problem by writing my own minimal test
case for it. I'm using the same TDD approach for this project as was
used for the related Postgres 17 project.

> This is on exactly the same data, after dropping caches and restarting
> the instance. So there should be no caching effects. Yet, there's a
> pretty clear difference - the total number of buffers is the same, but
> the patched version has many more hits. Yet it's slower. Weird, right?

Yes, it's weird. It seems likely that you've found an unambiguous bug,
not just a "regular" performance regression. The regressions that I
already know about aren't nearly this bad. So it seems like you have
the right general idea about what to expect, and it seems like your
approach to testing the patch is effective.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bertrand Drouvot 2024-09-18 18:57:39 Re: define pg_structiszero(addr, s, r)
Previous Message Bertrand Drouvot 2024-09-18 18:47:07 Re: Add contrib/pg_logicalsnapinspect