Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2024-01-23 20:22:58
Message-ID: CAH2-WznOdmwT+512h35XiWE3Fd-q6ueDbCD138eAiA2v+9o0sQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jan 22, 2024 at 4:13 PM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> On Fri, 19 Jan 2024 at 23:42, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> And this is where I disagree with your (percieved) implicit argument
> that this should be and always stay this way.

I never said that, and didn't intend to imply it. As I outlined to you
back in November, my general philosophy around APIs (such as the index
AM API) is that ambiguity about the limits and extent of an abstract
interface isn't necessarily a bad thing [1]. It can actually be a good
thing! Ever hear of Hyrum's Law? Abstractions are very often quite
leaky.

APIs like the index AM API inevitably make trade-offs. Trade-offs are
almost always political, in one way or another. Litigating every
possible question up-front requires knowing ~everything before you
really get started. This mostly ends up being a waste of time, since
many theoretical contentious trade-offs just won't matter one bit in
the long run, for one reason or another (not necessarily because there
is only ever one consumer of an API, for all sorts of reasons).

You don't have to agree with me. That's just what my experience
indicates works best on average, and in the long run. I cannot justify
it any further than that.

> By keeping that expectation alive,
> this becomes a self-fulfilling prophecy, and I really don't like such
> restrictive self-fulfilling prophecies. It's nice that we have index
> AM feature flags, but if we only effectively allow one implementation
> for this set of flags by ignoring potential other users when changing
> behavior, then what is the API good for (apart from abstraction
> layering, which is nice)?

I explicitly made it clear that I don't mean that.

> I think that may be right, but I could very well have built a
> btree-lite that doesn't have the historical baggage of having to deal
> with pages from before v12 (version 4) and some improvements that
> haven't made it to core yet.

Let me know if you ever do that. Let me know what the problems you
encounter are. I'm quite happy to revise my position on this in light
of new information. I change my mind all the time!

> Something like the following, maybe?
>
> """
> Compatibility: The planner will now generate paths with array scan
> keys in any column for ordered indexes, rather than on only a prefix
> of the index columns. The planner still expects fully ordered data
> from those indexes.
> Historically, the btree AM couldn't output correctly ordered data for
> suffix array scan keys, which was "fixed" by prohibiting the planner
> from generating array scan keys without an equality prefix scan key up
> to that attribute. In this commit, the issue has been fixed, and the
> planner restriction has thus been removed as the only in-core IndexAM
> that reports amcanorder now supports array scan keys on any column
> regardless of what prefix scan keys it has.
> """

Even if I put something like this in the commit message, I doubt that
Bruce would pick it up in anything like this form (I have my doubts
about it happening no matter what wording is used, actually).

I could include something less verbose, mentioning a theoretical risk
to out-of-core amcanorder routines that coevolved with nbtree,
inherited the same SAOP limitations, and then never got the same set
of fixes.

> patch-a is a trivial and clean implementation of mergesort, which
> tends to reduce the total number of compare operations if both arrays
> are of similar size and value range. It writes data directly back into
> the main array on non-assertion builds, and with assertions it reuses
> your binary search join on scratch space for algorithm validation.

I'll think about this some more.

> patch-b is an improved version of patch-a with reduced time complexity
> in some cases: It applies exponential search to reduce work done where
> there are large runs of unmatched values in either array, and does
> that while keeping O(n+m) worst-case complexity, now getting a
> best-case O(log(n)) complexity.

This patch seems massively over-engineered, though. Over 200 new lines
of code, all for a case that is only used when queries are written
with obviously-contradictory quals? It's just not worth it.

> > The recursive call to _bt_advance_array_keys is needed to deal with
> > unsatisfied-and-required inequalities that were not detected by an
> > initial call to _bt_check_compare, following a second retry call to
> > _bt_check_compare. We know that this recursive call to
> > _bt_advance_array_keys won't cannot recurse again because we've
> > already identified which specific inequality scan key it is that's a
> > still-unsatisfied inequality, following an initial round of array key
> > advancement.
>
> So, it's a case of this?
>
> scankey: a > 1 AND a = ANY (1 (=current), 2, 3)) AND b < 4 AND b = ANY
> (2 (=current), 3, 4)
> tuple: a=2, b=4

I don't understand why your example starts with "scankey: a > 1" and
uses redundant/contradictory scan keys (for both "a" and "b"). For a
forward scan the > scan key won't be seen as required by
_bt_advance_array_keys, which means that it cannot be relevant to the
branch containing the recursive call to _bt_advance_array_keys. (The
later branch that calls _bt_check_compare is another matter, but that
doesn't call _bt_advance_array_keys recursively at all -- that's not
what we're discussing.)

I also don't get why you've added all of this tricky redundancy to the
example you've proposed. That seems to make the example a lot more
complicated, without any apparent advantage. This particular piece of
code has nothing to do with redundant/contradictory scan keys.

> We first match the 'a' inequality key, then find an array-key mismatch
> breaking out, then move the array keys forward to a=2 (of ANY
> (1,2,3)), b=4 (of ANY(2, 3, 4)); and now we have to recheck the later
> b < 4 inequality key, as that required inequality key wasn't checked
> because the earlier arraykey did not match in _bt_check_compare,
> causing it to stop at the a=1 condition as opposed to check the b < 4
> condition?

I think that that's pretty close, yes.

Obviously, _bt_check_compare() is going to give up upon finding the
most significant now-unsatisfied scan key (which must be a required
scan key in cases relevant to the code in question) -- at that point
we need to advance the array keys. If we go on to successfully advance
all required arrays to values that all match the corresponding values
from caller's tuple, we might still have to consider inequalities
(that are required in the current scan direction).

It might still turn out that the relevant second call to
_bt_check_compare() (the first one inside _bt_advance_array_keys)
still sets continuescan=false -- despite what we were able to do with
the scan's array keys. At that point, the only possible explanation is
that there is a required inequality that still isn't satisfied. We
should use "beyond_end_advance" advancement to advance the array keys
"incrementally" a second time (in a recursive call "against the same
original tuple").

This can only happen when the value of each of two required scan keys
(some equality scan key plus some later inequality scan key) both
cease to be satisfied at the same time, *within the same tuple*. In
practice this just doesn't happen very often. It could in theory be
avoided altogether (without any behavioral change) by forcing
_bt_check_compare to always assess every scan key, rather than giving
up upon finding any unsatisfied scan key (though that would be very
inefficient).

It'll likely be much easier to see what I mean by considering a real
example. See my test case for this, which I don't currently plan to
commit:

https://github.com/petergeoghegan/postgres/blob/saop-dynamic-skip-v10.0/src/test/regress/sql/dynamic_saop_advancement.sql#L3600

I think that only this test case is the only one that'll actually
break when you just rip out the recursive _bt_advance_array_keys call.
And it still won't give incorrect answers when it breaks -- it just
accesses a single extra leaf page.

As I went into a bit already, upthread, this recursive call business
is a good example of making the implementation more complicated, in
order to preserve interface simplicity and generality. I think that
that's the right trade-off here, despite being kinda awkward.

> If so, then this visual explanation helped me understand the point and
> why it can't repeat more than once much better than all that text.
> Maybe this can be integrated?

An example seems like it'd be helpful as a code comment. Can you come
up with a simplified one?

> It was exactly why I started asking about the use of pointer
> additions: _bt_merge_arrays is new code and I can't think of any other
> case of this style being used in new code, except maybe when
> surrounding code has this style. The reason I first highlighted
> _bt_update_keys_with_arraykeys is because it had a clear case of 2
> different styles in the same function.

Meh.

[1] https://postgr.es/m/CAH2-WzmWn2_eS_4rWy90DRzC-NW-oponONR6PwMqy+OOuvVyFA@mail.gmail.com

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-01-23 20:25:20 Re: core dumps in auto_prewarm, tests succeed
Previous Message Robert Haas 2024-01-23 20:20:31 Re: Parallelize correlated subqueries that execute within each worker