Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, benoit <benoit(at)hopsandfork(dot)com>, Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Optimizing nbtree ScalarArrayOp execution, allowing multi-column ordered scans, skip scan
Date: 2024-01-19 22:41:59
Message-ID: CAH2-Wzk0B0Ccso31cr1Eo8mfEE5jaggez6kD6WfmPhOQzhkSPA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 18, 2024 at 11:39 AM Matthias van de Meent
<boekewurm+postgres(at)gmail(dot)com> wrote:
> > I'm not going to break out the planner changes, because they're *not*
> > independent in any way.
>
> The planner changes depend on the btree changes, that I agree with.
> However, I don't think that the btree changes depend on the planner
> changes.

Yeah, technically it would be possible to break out the indxpath.c
changes -- that wouldn't leave the tree in a state with visibly-wrong
behavior at any point. But that's far from the only thing that
matters. If that was the standard that we applied, then I might have
to split the patch into 10 or more patches.

What it boils down to is this: it is totally natural for me to think
of the planner changes as integral to the nbtree changes, so I'm going
to include them together in one commit. That's just how the code was
written -- I always thought about it as one single thing. The majority
of the queries that I've used to promote the patch directly rely on
the planner changes in one way or another (even back in v1, when the
planner side of things had lots of kludges).

I don't necessarily always follow that standard. Sometimes it is
useful to further break things up. Like when it happens to make the
high-level division of labor a little bit clearer. The example that
comes to mind is commit dd299df8 and commit fab25024. These nbtree
commits were essentially one piece of work that was broken into two,
but committed within minutes of each other, and left the tree in a
momentary state that was not-very-useful (though still correct). That
made sense to me at the time because the code in question was
mechanically distinct, while at the same time modifying some of the
same nbtree files. Drawing a line under it seemed to make sense.

I will admit that there is some amount of subjective judgement (gasp!)
here. Plus I'll acknowledge that it's *slightly* odd that the most
compelling cases for the SAOP patch almost all involve savings in heap
page accesses, even though it is fundamentally a B-Tree patch. But
that's just how it came out. Slightly odd things happen all the time.

> I would agree with you if this was about new APIs and features, but
> here existing APIs are being repurposed without changing them. A
> maintainer of index AMs would not look at the commit title 'Enhance
> nbtree ScalarArrayOp execution' and think "oh, now I have to make sure
> my existing amsearcharray+amcanorder handling actually supports
> non-prefix arrays keys and return data in index order".

This is getting ridiculous.

It is quite likely that there are exactly zero affected out-of-core
index AMs. I don't count pgroonga as a counterexample (I don't think
that it actually fullfills the definition of a ). Basically,
"amcanorder" index AMs more or less promise to be compatible with
nbtree, down to having the same strategy numbers. So the idea that I'm
going to upset amsearcharray+amcanorder index AM authors is a
completely theoretical problem. The planner code evolved with nbtree,
hand-in-glove.

I'm more than happy to add a reminder to the commit message about
adding a proforma listing to the compatibility section of the Postgres
17 release notes. Can we actually agree on which theoretical index AM
types are affected first, though? Isn't it actually
amsearcharray+amcanorder+amcanmulticol external index AMs only? Do I
have that right?

BTW, how should we phrase this compatibility note, so that it can be
understood? It's rather awkward.

> > As I said to Heikki, thinking about this some more is on my todo list.
> > I mean the way that this worked had substantial revisions on HEAD
> > right before I posted v9. v9 was only to fix the bit rot that that
> > caused.
>
> Then I'll be waiting for that TODO item to be resolved.

My TODO item is to resolve the question of whether and to what extent
these two optimizations should be combined. It's possible that I'll
decide that it isn't worth it, for whatever reason. At this point I'm
still totally neutral.

> > Even single digit
> > thousand of array elements should be considered huge. Plus this code
> > path is only hit with poorly written queries.
>
> Would you accept suggestions?

You mean that you want to have a go at it yourself? Sure, go ahead.

I cannot promise that I'll accept your suggested revisions, but if
they aren't too invasive/complicated compared to what I have now, then
I will just accept them. I maintain that this isn't really necessary,
but it might be the path of least resistance at this point.

> > > Please fix this, as it shows up in profiling of large array merges.
> > > Additionally, as it merges two arrays of unique items into one,
> > > storing only matching entries, I feel that it is quite wasteful to do
> > > this additional allocation here. Why not reuse the original allocation
> > > immediately?
> >
> > Because that's adding more complexity for a code path that will hardly
> > ever be used in practice. For something that happens once, during a
> > preprocessing step.
>
> Or many times, when we're in a parameterized loop, as was also pointed
> out by Heikki.

That's not really what Heikki said. Heikki had a general concern about
the startup costs with nestloop joins.

> > > Additionally, could you try to create a single point of entry for the
> > > array key stuff that covers the new systems? I've been trying to wrap
> > > my head around this, and it's taking a lot of effort.
> >
> > I don't understand what you mean here.
>
> The documentation (currently mostly code comments) is extensive, but
> still spread around various inline comments and comments on functions;
> with no place where a clear top-level design is detailed.
> I'll agree that we don't have that for the general systems in
> _bt_next() either, but especially with this single large patch it's
> difficult to grasp the specific differences between the various
> functions.

Between what functions? The guts of this work are in the new
_bt_advance_array_keys and _bt_tuple_before_array_skeys (honorable
mention for _bt_array_keys_remain). It seems pretty well localized to
me.

Granted, there are a few places where we rely on things being set up a
certain way by other code that's quite some distance away (from the
code doing the depending). For example, the new additions to
_bt_preprocess_keys that we need later on, in
_bt_advance_array_keys/_bt_update_keys_with_arraykeys. These bits and
pieces of code are required, but are not particularly crucial to
understanding the general design.

For the most part the design is that we cede control of array key
advancement to _bt_checkkeys() -- nothing much else changes. There are
certainly some tricky details behind the scenes, which we should be
verifying via testing and especially via robust invariants (verified
with assertions). But this is almost completely hidden from the
nbtsearch.c caller -- there are no real changes required there.

> It wasn't really meant as a restatement: It is always unsafe to ignore
> upper bits, as the index isn't organized by that. However, it *could*
> be safe to ignore the bits with lowest significance, as the index
> would still be organized correctly even in that case, for int4 at
> least. Similar to how you can have required scankeys for the prefix of
> an (int2, int2) index, but not the suffix (as of right now at least).

It isn't safe to assume that types appearing together within an
opfamily are binary coercible (or capable of being type-punned like
this) in the general case. That often works, but it isn't reliable.
Even with integer_ops, it breaks on big-endian machines.

> This won't ever be an issue when there is a requirement that if a = b
> and b = c then a = c must hold for all triples of typed values and
> operations in a btree opclass, as you mention above.

Right. It also doesn't matter if there are redundant equality
conditions that we cannot formally prove are redundant during
preprocessing, for want of appropriate cross-type comparison routines
from the opfamily. The actual behavior of index scans (in terms of
when and how we skip) won't be affected by that at all. The only
problem that it creates is that we'll waste CPU cycles, relative to
the case where we can somehow detect this redundancy.

In case it wasn't clear before: the optimizations I've added to
_bt_preprocess_array_keys are intended to compensate for the
pessimizations added to _bt_preprocess_keys (both the optimizations
and the pessimizations only affect equality type array scan keys). I
don't think that this needs to be perfect; it just seems like a good
idea.

> > > I'm also no fan of the (tail) recursion. I would agree that this is
> > > unlikely to consume a lot of stack, but it does consume stackspace
> > > nonetheless, and I'd prefer if it was not done this way.
> >
> > I disagree. The amount of stack space it consumes in the worst case is fixed.
>
> Is it fixed? Without digging very deep into it, it looks like it can
> iterate on the order of n_scankeys deep? The comment above does hint
> on 2 iterations, but is not very clear about the conditions and why.

The recursive call to _bt_advance_array_keys is needed to deal with
unsatisfied-and-required inequalities that were not detected by an
initial call to _bt_check_compare, following a second retry call to
_bt_check_compare. We know that this recursive call to
_bt_advance_array_keys won't cannot recurse again because we've
already identified which specific inequality scan key it is that's a
still-unsatisfied inequality, following an initial round of array key
advancement.

We're pretty much instructing _bt_advance_array_keys to perform
"beyond_end_advance" type advancement for the specific known
unsatisfied inequality scan key (which must be required in the current
scan direction, per the assertions for that) here. So clearly it
cannot end up recursing any further -- the recursive call is
conditioned on "all_arraylike_sk_satisfied && arrays_advanced", but
that'll be unset when "ikey == sktrig && !array" from inside the loop.

This is a narrow edge-case -- it is an awkward one. Recursion does
seem natural here, because we're essentially repeating the call to
_bt_advance_array_keys from inside _bt_advance_array_keys, rather than
leaving it up to the usual external caller to do later on. If you
ripped this code out it would barely be noticeable at all. But it
seems worth it so that we can make a uniform promise to always advance
the array keys to the maximum extent possible, based on what the
caller's tuple tells us about the progress of the scan.

Since all calls to _bt_advance_array_keys are guaranteed to advance
the array keys to the maximum extent that's safely possible (barring
this one edge-case with recursive calls), it almost follows that there
can't be another recursive call. This is the one edge case where the
implementation requires a second pass over the tuple -- we advanced
the array keys to all-matching values, but nevertheless couldn't
finish there due to the unsatisfied required inequality (which must
have become unsatisfied _at the same time_ as some earlier required
equality scan key for us to end up requiring a recursive call).

> > Because in one case we follow the convention for scan keys, whereas
> > so->orderProcs is accessed via an attribute number subscript.
>
> Okay, but how about this in _bt_merge_arrays?
>
> + Datum *elem = elems_orig + i;
>
> I'm not familiar with the scan key convention, as most other places
> use reference+subscripting.

I meant the convention used in code like _bt_check_compare (which is
what we call _bt_checkkeys on HEAD, basically).

Note that the _bt_merge_arrays code that you've highlighted isn't
iterating through so->keyData[] -- it is iterating through the
function caller's elements array, which actually come from
so->arrayKeys[].

Like every other Postgres contributor, I do my best to follow the
conventions established by existing code. Sometimes that leads to
pretty awkward results, where CamelCase and underscore styles are
closely mixed together, because it works out to be the most consistent
way of doing it overall.

--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-01-20 00:03:24 Re: Oom on temp (un-analyzed table caused by JIT) V16.1 [Fixed Already]
Previous Message reid.thompson 2024-01-19 22:37:33 Re: [DOC] Add detail regarding resource consumption wrt max_connections