Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date: 2024-11-06 19:07:46
Message-ID: CAH2-Wzny5oJ7RGcTO9vT9wyzXqx-N0qivtPXm+G0tuv68rdF9Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 4, 2024 at 4:58 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> This is a review on v11, not the latest v13. I suspect most comments
> still apply, but I haven't verified this.

v11 is indeed quite similar to v13, so this shouldn't really matter.

> I'm a bit concerned about the additional operations that are being
> added to the scan. Before this patch, the amount of work in the
> "horizontal" portion of the scan was limited to user-supplied
> scankeys, so O(1) even when the index condition is only (f < 7). But,
> with this patch, we're adding work for (a=, b=, c=, etc.) for every
> tuple in the scan.

There's no question that there are still some cases where this cannot
possibly pay for itself. And that just isn't acceptable -- no
arguments here.

> As these new "skip array" keys are primarily useful for inter-page
> coordination (by determining if we want to start a primitive scan to
> skip to a different page and which value range that primitive scan
> would search for, or continue on to the next sibling), can't we only
> apply the "skip array" portion of the code at the final tuple we
> access on this page?

I plan on doing something like that. I'll need to.

AFAICT I only need to avoid wasting CPU cycles here -- there are no
notable regressions from performing excessive amounts of index
descents, as far as I can tell. And so I plan on doing this without
fundamentally changing anything about the current design.

In particular, I want the performance to remain robust in cases where
the best strategy varies significantly as the scan progresses -- even
when we need to strongly favor skipping at first, and then strongly
favor staying/not skipping later on (all during the same individual
index scan). I really like that the current design gets that part
right.

> While this section already defines some things about index scans which
> seem btree-specific, I don't think we should add more references to
> btree scan internals in a section about bitmaps and bitmap index
> scans.

This section of the docs discusses the trade-off between multiple
single column indexes, and fewer multi-column indexes. How could skip
scan not be relevant to such a discussion? One of the main benefits of
skip scan is that it'll allow users to get by with fewer indexes.

> > +++ b/src/backend/access/nbtree/nbtree.c
> [...]
> > - slock_t btps_mutex; /* protects above variables, btps_arrElems */
> > + LWLock btps_lock; /* protects above variables, btps_arrElems */
>
> Why is this changed to LWLock, when it's only ever acquired exclusively?

In general one should never do more than an extremely small, tightly
controlled amount of work with a spinlock held. It's now possible that
we'll allocate memory with the lock held -- doing that with a spinlock
held is an obvious no-no. We really need to use an LWLock for this
stuff now, on general principle.

> > +btestimateparallelscan(Relation rel, int nkeys, int norderbys)
>
> I notice you're using DatumSerialize. Are there reasons why we
> wouldn't want to use heap_fill_tuple, which generally produces much
> more compact output?

heap_fill_tuple doesn't support the notion of -inf and +inf scan key
sentinel values. Plus I'm inclined to use DatumSerialize because it's
more or less designed for this kind of problem.

> Also, I think you can limit the space usage to BLCKSZ in total,
> because a full index tuple can't be larger than 1/3rd of a block; and
> for skip scans we'll only have known equality bounds for a prefix of
> attributes available in the index tuples, and a single (?)
> index-produced dynamic attribute we want to skip ahead of. So, IIUC,
> at most we'll have 2 index tuples' worth of data, or 2/3 BLCKSZ.
> Right?

Possibly, but who wants to take a chance? The scheme you're describing
only saves memory when there's 3 skip arrays, which is fairly unlikely
in general.

I think that the approach taken to serializing the array keys should
be as conservative as possible. It's not particularly likely that
we'll want to do a parallel skip scan. It's rather hard to test those
code paths.

> I needed to look up what this 'cons up' thing is, as it wasn't
> something that I'd seen before. It also seems used exclusively in
> btree code, and only after the array keys patch, so I think it'd be
> better in general to use 'construct' here.

FWIW I wasn't the first person to use the term in the nbtree code.

I think you're right, though. It is a needlessly obscure term that is
only known to Lisp hackers. I'll fix it.

> > +++ b/src/backend/access/nbtree/nbtcompare.c
>
> The changes here are essentially 6x the same code, but for different
> types. What do you think about the attached
> 0001-Deduplicate[...].patch.txt, which has the same effect but with
> only 1 copy of the code checked in?

Reminds me of the approach taken by this extension:

https://github.com/petere/pguint

I do find the need to write so much boilerplate code for B-Tree
opclasses annoying. I also find it annoying that the nbtree code
insists on being as forgiving as possible with incomplete opfamilies.
But those problems seem out of scope here -- not like I'm really
making it any worse.

> > +++b/src/backend/access/nbtree/nbtutils.c
> [...]
> > +_bt_decide_skipatts(IndexScanDesc scan, BTSkipPreproc *skipatts)
>
> Why does this stop processing keys after hitting a row compare?

Why not? It's not ideal, but there are a number of things about
RowCompare scan keys that are already less than ideal. We don't even
try to do any kind of preprocessing that involves RowCompares --
they're already a slightly awkward special case to the nbtree code.

> Doesn't skip scan still apply to any subsequent normal keys? E.g.
> "c=1" creates a scan "a=skip, b=skip, c=1", so "(a, b)>(1, 2), c=1"
> should IMO still allow a skip scan for a=skip, b=1 to be constructed -
> it shouldn't be that we get a much less specific (and potentially,
> performant) scan just by adding a rowcompare scankey on early
> attributes.

It's just awkward to get it to work as expected, while still
preserving all of the useful properties of the design.

Your "(a, b)>(1,2)" scan key returns rows matching:

WHERE (a = 1 AND b > 2) OR (a > 1)

And so your complete "(a, b)>(1,2) AND c = 1" qual returns rows matching:

WHERE ((a = 1 AND b > 2) OR (a > 1)) AND c = 1

It is difficult to imagine how the existing design for skip arrays can
be extended to support this kind of qual, though. I guess we'd still
need skip arrays on both "a" and "b" here, though. Right?

The "b" skip array would be restricted to a range of values between 3
and +inf inclusive, if and only if we were still on the "a" skip
array's first array element (i.e. iff "a = 1" the "b" has a valid
low_compare). Otherwise (i.e. when "a > 1"), the "b" skip array
wouldn't be constrained by any low_compare inequality. So
low_compare/high_compare only apply conditionally, in a world where we
support these kinds of RowCompare quals. Right now I can avoid the
problem by refusing to allow "c = 1" to ever be marked required (by
never creating any skip array on an index attribute >= an attribute
with a RowCompare key on input).

Obviously, the current design of skip arrays involves arrays that are
always constrained by the same range/low_compare and high_compare
inequalities, independent of any other factor/any wider context. It's
not impossible to make something like your RowCompare case work, but
it'd be very significantly more complicated than the existing design.
Though not all that much more useful.

Doing something like this might make sense in the context of a project
that adds support for the MDAM paper's "General OR Optimization"
transformations -- RowCompare support would only be a bonus. I don't
see any opportunities to target RowCompare as independent work --
seems as if RowCompare quals aren't significantly simpler than what is
required to support very general MDAM OR optimizations.

> > _bt_preprocess_array_keys
> > - output_ikey = 0;
> > + numArrayKeyData,
> > + numSkipArrayKeys;
>
> I don't think numArrayKeyData/arrayKeyData are good names here, as it
> confused me many times reviewing this function's changes.

The code in this area actually did change recently, though I didn't announce it.

> On a scankey
> of a=1,b=2 we won't have any array keys, yet this variable is set to
> 2. Something like numOutputKeys is probably more accurate.

That's not accurate. _bt_preprocess_array_keys() sets
*new_numberOfKeys on output. When there are no array keys (of either
type) to be output, _bt_preprocess_array_keys will just return NULL --
it won't have changed the *new_numberOfKeys passed by its
_bt_preprocess_keys caller when it returns NULL.

In other words, when _bt_preprocess_array_keys determines that there
are no array keys to process (neither SAOP arrays nor skip arrays), it
won't modify anything, and won't return an alternative
input-to-_bt_preprocess_keys scan key array. _bt_preprocess_keys will
work directly off of the scan->keyData[] input scan keys passed by the
executor proper.

> > + /* Create a skip array and scan key where indicated by skipatts */
> > + while (numSkipArrayKeys &&
> > + attno_skip <= scan->keyData[input_ikey].sk_attno)
> > + {
> > + Oid opcintype = rel->rd_opcintype[attno_skip - 1];
> > + Oid collation = rel->rd_indcollation[attno_skip - 1];
> > + Oid eq_op = skipatts[attno_skip - 1].eq_op;
> > + RegProcedure cmp_proc;
> > +
> > + if (!OidIsValid(eq_op))
> > + {
> > + /* won't skip using this attribute */
> > + attno_skip++;
>
> Isn't this branch impossible, given that numSkipArrayKeys is output
> from _bt_decide_skipatts, whose output won't contain skipped
> attributes which have eq_op=InvalidOid? I'd replace this with
> Assert(OidIsValid(eq_op)).

It's very possible to hit this code path -- we'll hit this code path
every time an explicit "=" input scan key appears before an attribute
that requires a skip array (which happens whenever there's a third,
later column that we'll be able to mark required thanks to the skip
array).

Note that BTSkipPreproc.eq_op is the "= op to be used to add array, if
any". And so when we see !OidIsValid(eq_op) in this loop, it means
"this is for an index attribute that we don't want to add a skip array
to" (though, as I said, a later attribute is expected to get a skip
array when this happens).

> > _bt_rewind_nonrequired_arrays
>
> What types of scan keys can still generate non-required array keys?

Right now it's:

1. As you said, certain cases involving RowCompare scan keys.
2. Cases involving B-Tree operator classes that don't even have a "="
operator (yes, technically those are supported!).

Also seems possible that I'll end up relying on our support for
non-required arrays when I go fix those regressions. Seems possible
that I'll get rid of the explicit SK_BT_REQFWD/SK_BT_REQBKWD markings,
and go back to treating scan keys as required based on context (a
little like how things were prior to commit 7ccaf13a06).

> It seems to me those are now mostly impossible, as this patch generates
> required skip arrays for all attributes that don't yet have an
> equality key and which are ahead of any (in)equality keys, except the
> case with row compare keys which I already commented on above.

I agree that non-required arrays become an obscure edge case with the
patch, having been reasonably common before now (well, they were
common in Postgres 17). I don't think that that provides us with any
opportunities to get rid of unneeded code.

> > utils/skipsupport.[ch]
> I'm not sure why this is included in utils - isn't this exclusively
> used in access/nbtree/*?

The location of skip support is based on (though slightly different
to) the location of sort support. In general many B-Tree opclasses are
implemented in or around src/utils/adt/*. The exception is all of the
stuff in nbtcompare.c, though I always found that weird (I wouldn't
mind getting rid of nbtcompare.c by relocating its code places like
int.c and int8.c).

> > +++ b/src/include/access/nbtree.h
> BTArrayKeyInfo explodes in size, from 24B to 88B. I think some of that
> is necessary, but should it really be that large?

I'm disinclined to do anything about it right now.

I'll make a note of it, and review when the most important performance
problems are fixed.

Thanks for the review
--
Peter Geoghegan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Vladlen Popolitov 2024-11-06 19:10:06 Re: Changing shared_buffers without restart
Previous Message Robert Haas 2024-11-06 19:06:58 magical eref alias names