Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2023-12-05 03:25:55
Message-ID: CAH2-Wz=dbKvQ-wB24VhHzF_gjyckZ531yYDJ6JNfU8aBC+2e6w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 27, 2023 at 5:39 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> - +1 on the general idea. Hard to see any downsides if implemented right.

Glad you think so. The "no possible downside" perspective is one that
the planner sort of relies on, so this isn't just a nice-to-have -- it
might actually be a condition of committing the patch. It's important
that the planner can be very aggressive about using SAOP index quals,
without us suffering any real downside at execution time.

> - This changes the meaning of amsearcharray==true to mean that the
> ordering is preserved with ScalarArrayOps, right? You change B-tree to
> make that true, but what about any out-of-tree index AM extensions? I
> don't know if any such extensions exist, and I don't think we should
> jump through any hoops to preserve backwards compatibility here, but
> probably deserves a notice in the release notes if nothing else.

My interpretation is that the planner changes affect amcanorder +
amsearcharray index AMs, but have no impact on mere amsearcharray
index AMs. If anything this is a step *away* from knowing about nbtree
implementation details in the planner (though the planner's definition
of amcanorder is very close to the behavior from nbtree, down to
things like knowing all about nbtree strategy numbers). The planner
changes from the patch are all subtractive -- I'm removing kludges
that were added by bug fix commits. Things that weren't in the
original feature commit at all.

I used the term "my interpretation" here because it seems hard to
think of this in abstract terms, and to write a compatibility note for
this imaginary audience. I'm happy to go along with whatever you want,
though. Perhaps you can suggest a wording for this?

> - You use the term "primitive index scan" a lot, but it's not clear to
> me what it means. Does one ScalarArrayOps turn into one "primitive index
> scan"? Or each element in the array turns into a separate primitive
> index scan? Or something in between? Maybe add a new section to the
> README explain how that works.

The term primitive index scan refers to the thing that happens each
time _bt_first() is called -- with and without the patch. In other
words, it's what happens when pg_stat_all_indexes.idx_scan is
incremented.

You could argue that that's not quite the right thing to be focussing
on, with this new design. But it has precedent going for it. As I
said, it's the thing that pg_stat_all_indexes.idx_scan counts, which
is a pretty exact and tangible thing. So it's consistent with
historical practice, but also with what other index AMs do when
executing ScalarArrayOps non-natively.

> - _bt_preprocess_array_keys() is called for each btrescan(). It performs
> a lot of work like cmp function lookups and desconstructing and merging
> the arrays, even if none of the SAOP keys change in the rescan. That
> could make queries with nested loop joins like this slower than before:
> "select * from generate_series(1, 50) g, tenk1 WHERE g = tenk1.unique1
> and tenk1.two IN (1,2);".

But that's nothing new. _bt_preprocess_array_keys() isn't a new
function, and the way that it's called isn't new in any way.

That said, I certainly agree that we should be worried about any added
overhead in _bt_first for nested loop joins with an inner index scan.
In my experience that can be an important issue. I actually have a
TODO item about this already. It needs to be included in my work on
performance validation, on general principle.

> - nbtutils.c is pretty large now. Perhaps create a new file
> nbtpreprocesskeys.c or something?

Let me get back to you on this.

> - You and Matthias talked about an implicit state machine. I wonder if
> this could be refactored to have a more explicit state machine. The
> state transitions and interactions between _bt_checkkeys(),
> _bt_advance_array_keys() and friends feel complicated.

I agree that it's complicated. That's the main problem that the patch
has, by far. It used to be even more complicated, but it's hard to see
a way to make it a lot simpler at this point. If you can think of a
way to simplify it then I'll definitely give it a go.

Can you elaborate on "more explicit state machine"? That seems like it
could have value by adding more invariants, and making things a bit
more explicit in one or two areas. It could also help us to verify
that they hold from assertions. But that isn't really the same thing
as simplification. I wouldn't use that word, at least.

> > + <note>
> > + <para>
> > + Every time an index is searched, the index's
> > + <structname>pg_stat_all_indexes</structname>.<structfield>idx_scan</structfield>
> > + field is incremented. This usually happens once per index scan node
> > + execution, but might take place several times during execution of a scan
> > + that searches for multiple values together. Only queries that use certain
> > + <acronym>SQL</acronym> constructs to search for rows matching any value
> > + out of a list (or an array) of multiple scalar values are affected. See
> > + <xref linkend="functions-comparisons"/> for details.
> > + </para>
> > + </note>
> > +
>
> Is this true even without this patch? Maybe commit this separately.

Yes, it is. The patch doesn't actually change anything in this area.
However, something in this area is new: it's a bit weird (but still
perfectly consistent and logical) that the count shown by
pg_stat_all_indexes.idx_scan will now show a value that is often
influenced by low-level implementation details now. Things that are
fairly far removed from the SQL query now affect that count -- that
part is new. That's what I had in mind when I wrote this, FWIW.

> The "Only queries ..." sentence feels difficult. Maybe something like
> "For example, queries using IN (...) or = ANY(...) constructs.".

I'll get back to you on this.

> > * _bt_preprocess_keys treats each primitive scan as an independent piece of
> > * work. That structure pushes the responsibility for preprocessing that must
> > * work "across array keys" onto us. This division of labor makes sense once
> > * you consider that we're typically called no more than once per btrescan,
> > * whereas _bt_preprocess_keys is always called once per primitive index scan.
>
> "That structure ..." is a garden-path sentence. I kept parsing "that
> must work" as one unit, the same way as "that structure", and it didn't
> make sense. Took me many re-reads to parse it correctly. Now that I get
> it, it doesn't bother me anymore, but maybe it could be rephrased.

I'll do that ahead of the next revision.

> Is there _any_ situation where _bt_preprocess_array_keys() is called
> more than once per btrescan?

No. Note that we don't know the scan direction within
_bt_preprocess_array_keys(). We need a separate function to set up the
array keys to their initial positions (this is nothing new).

> > /*
> > * Look up the appropriate comparison operator in the opfamily.
> > *
> > * Note: it's possible that this would fail, if the opfamily is
> > * incomplete, but it seems quite unlikely that an opfamily would omit
> > * non-cross-type comparison operators for any datatype that it supports
> > * at all. ...
> > */
>
> I agree that's unlikely. I cannot come up with an example where you
> would have cross-type operators between A and B, but no same-type
> operators between B and B. For any real-world opfamily, that would be an
> omission you'd probably want to fix.
>
> Still I wonder if we could easily fall back if it doesn't exist? And
> maybe add a check in the 'opr_sanity' test for that.

I'll see about an opr_sanity test.

> In _bt_readpage():

> > if (!so->firstPage && !numArrayKeys && minoff < maxoff)
> > {
>
> It's sad to disable this optimization completely for array keys. It's
> actually a regression from current master, isn't it? There's no
> fundamental reason we couldn't do it for array keys so I think we should
> do it.

I'd say whether or not there's any kind of regression in this area is
quite ambiguous, though in a way that isn't really worth discussing
now. If it makes sense to extend something like this optimization to
array keys (or to add a roughly equivalent optimization), then we
should do it. Otherwise we shouldn't.

Note that the patch actually disables two distinct and independent
optimizations when the scan has array keys. Both of these were added
by recent commit e0b1ee17, but they are still independent things. They
are:

1. This skipping thing inside _bt_readpage, which you've highlighted.

This is only applied on the second or subsequent leaf page read by the
scan. Right now, in the case of a scan with array keys, that means the
second or subsequent page from the current primitive index scan --
which doesn't seem particularly principled to me.

I'd need to invent a heuristic that works with my design to adapt the
optimization. Plus I'd need to be able to invalidate the precheck
whenever the array keys advanced. And I'd probably need a way of
guessing whether or not it's likely that the arrays will advance,
ahead of time, so that the precheck doesn't almost always go to waste,
in a way that just doesn't make sense.

Note that all required scan keys are relevant here. I like to think of
plain required equality strategy scan keys without any array as
"degenerate single value arrays". Something similar can be said of
inequality strategy required scan keys (those required in the
*current* scan direction), too. So it's not as if I can "just do the
precheck stuff for the non-array scan keys". All required scan keys
are virtually the same thing as required array-type scan keys -- they
can trigger "roll over", affecting array key advancement for the scan
keys that are associated with arrays.

2. The optimization that has _bt_checkkeys skip non-required scan keys
that are *only* required in the direction *opposite* the current scan
direction -- this can work even without any precheck from
_bt_readpage.

Note that this second optimization relies on various behaviors in
_bt_first() that make it impossible for _bt_checkkeys() to ever see a
tuple that could fail to satisfy such a scan key -- we must always
have passed over non-matching tuples, thanks to _bt_first(). That
prevents my patch with a problem: the logic for determining whether or
not we need a new primitive index scan only promises to never require
the scan to grovel through many leaf pages that _bt_first() could and
should just skip over instead. This new logic makes no promises about
skipping over small numbers of tuples. So it's possible that
_bt_checkkeys() will see a handful of tuples "after the end of the
_bt_first-wise primitive index scan", but "before the _bt_first-wise
start of the next would-be primitive index scan".

Note that this stuff matters even without bringing optimization 2 into
it. There are similar considerations for required equality strategy
scan keys, which (by definition) must be required in both scan
directions. The new mechanism must never act as if it's past the end
of matches in the current scan direction, when in fact it's really
before the beginning of matches (that would lead to totally ignoring a
group of equal matches). The existing _bt_checkkeys() logic can't
really tell the difference on its own, since it only has an = operator
to work with (well, I guess it knows about this context, since there
is a comment about the dependency on _bt_first behaviors in
_bt_checkkeys on HEAD -- very old comments).

> _bt_checkkeys() is called in an assertion in _bt_readpage, but it has
> the side-effect of advancing the array keys. Side-effects from an
> assertion seems problematic.

I agree that that's a concern, but just to be clear: there are no
side-effects presently. You can't mix the array stuff with the
optimization stuff. We won't actually call _bt_checkkeys() in an
assertion when it can cause side-effects.

Assuming that we ultimately conclude that the optimizations *aren't*
worth preserving in any form, it might still be worth making it
obvious that the assertions have no side-effects. But that question is
unsettled right now.

Thanks for the review!

I'll try to get the next revision out soon. It'll also have bug fixes
for mark + restore and for a similar issue seen when the scan changes
direction in just the wrong way. (In short, the array key state
machine can be confused about scan direction in certain corner cases.)

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Richard Guo 2023-12-05 03:40:53 Re: Rename ShmemVariableCache and initialize it in more standard way
Previous Message Joe Conway 2023-12-05 02:54:50 Re: Emitting JSON to file using COPY TO