Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From: Peter Geoghegan <pg(at)bowt(dot)ie>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date: 2024-09-17 22:14:46
Message-ID: CAH2-Wz=E7XrkvscBN0U6V81NK3Q-dQOmivvbEsjG-zwEfDdFpg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 16, 2024 at 6:05 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> I've been looking at this patch over the couple last days, mostly doing
> some stress testing / benchmarking (hence the earlier report) and basic
> review.

Thanks for taking a look! Very helpful.

> I do have some initial review comments, and the testing produced
> some interesting regressions (not sure if those are the cases where
> skipscan can't really help, that Peter mentioned he needs to look into).

The one type of query that's clearly regressed in a way that's just
not acceptable are queries where we waste CPU cycles during scans
where it's truly hopeless. For example, I see a big regression on one
of the best cases for the Postgres 17 work, described here:

https://pganalyze.com/blog/5mins-postgres-17-faster-btree-index-scans#a-practical-example-3x-performance-improvement

Notably, these cases access exactly the same buffers/pages as before,
so this really isn't a matter of "doing too much skipping". The number
of buffers hit exactly matches what you'll see on Postgres 17. It's
just that we waste too many CPU cycles in code such as
_bt_advance_array_keys, to uselessly maintain skip arrays.

I'm not suggesting that there won't be any gray area with these
regressions -- nothing like this will ever be that simple. But it
seems to me like I should go fix these obviously-not-okay cases next,
and then see where that leaves everything else, regressions-wise. That
seems likely to be the most efficient way of dealing with the
regressions. So I'll start there.

That said, I *would* be surprised if you found a regression in any
query that simply didn't receive any new scan key transformations in
new preprocessing code in places like _bt_decide_skipatts and
_bt_skip_preproc_shrink. I see that many of the queries that you're
using for your stress-tests "aren't really testing skip scan", in this
sense. But I'm hardly about to tell you that you shouldn't spend time
on such queries -- that approach just discovered a bug affecting
Postgres 17 (that was also surprising, but it still happened!). My
point is that it's worth being aware of which test queries actually
use skip arrays in the first place -- it might help you with your
testing. There are essentially no changes to _bt_advance_array_keys
that'll affect traditional SAOP arrays (with the sole exception of
changes made by
v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch, which
affect every kind of array in the same way).

> 1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
>
> - I find the places that increment "nsearches" a bit random. Each AM
> does it in entirely different place (at least it seems like that to me).
> Is there a way make this a bit more consistent?

From a mechanical perspective there is nothing at all random about it:
we do this at precisely the same point that we currently call
pgstat_count_index_scan, which in each index AM maps to one descent of
the index. It is at least consistent. Whenever a B-Tree index scan
shows "Index Scans: N", you'll see precisely the same number by
swapping it with an equivalent contrib/btree_gist-based GiST index and
running the same query again (assuming that index tuples that match
the array keys are spread apart in both the B-Tree and GiST indexes).

(Though I see problems with the precise place that nbtree calls
pgstat_count_index_scan right now, at least in certain edge-cases,
which I discuss below in response to your questions about that.)

> uint64 btps_nsearches; /* instrumentation */
>
> Instrumentation what? What's the counter for?

Will fix.

In case you missed it, there is another thread + CF Entry dedicated to
discussing this instrumentation patch:

https://commitfest.postgresql.org/49/5183/
https://www.postgresql.org/message-id/flat/CAH2-WzkRqvaqR2CTNqTZP0z6FuL4-3ED6eQB0yx38XBNj1v-4Q(at)mail(dot)gmail(dot)com

> - I see _bt_first moved the pgstat_count_index_scan, but doesn't that
> mean we skip it if the earlier code does "goto readcomplete"? Shouldn't
> that still count as an index scan?

In my opinion, no, it should not.

We're counting the number of times we'll have descended the tree using
_bt_search (or using _bt_endpoint, perhaps), which is a precisely
defined physical cost. A little like counting the number of buffers
accessed. I actually think that this aspect of how we call
pgstat_count_index_scan is a bug that should be fixed, with the fix
backpatched to Postgres 17. Right now, we see completely different
counts for a parallel index scan, compared to an equivalent serial
index scan -- differences that cannot be explained as minor
differences caused by parallel scan implementation details. I think
that it's just wrong right now, on master, since we're simply not
counting the thing that we're supposed to be counting (not reliably,
not if it's a parallel index scan).

> - show_indexscan_nsearches does this:
>
> if (scanDesc && scanDesc->nsearches > 0)
> ExplainPropertyUInteger("Index Searches", NULL,
> scanDesc->nsearches, es);
>
> But shouldn't it divide the count by nloops, similar to (for example)
> show_instrumentation_count?

I can see arguments for and against doing it that way. It's
ambiguous/subjective, but on balance I favor not dividing by nloops.
You can make a similar argument for doing this with "Buffers: ", and
yet we don't divide by nloops there, either.

Honestly, I just want to find a way to do this that everybody can live
with. Better documentation could help here.

> 2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch
>
> - Admittedly very subjective, but I find the "oppoDirCheck" abbreviation
> rather weird, I'd just call it "oppositeDirCheck".

Will fix.

> 3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch
>
> - nothing

Great. I think that I should be able to commit this one soon, since
it's independently useful work.

> 4) v6-0004-Add-skip-scan-to-nbtree.patch
>
> - indices.sgml seems to hahve typo "Intevening" -> "Intervening"
>
> - It doesn't seem like a good idea to remove the paragraph about
> multicolumn indexes and replace it with just:
>
> Multicolumn indexes should be used judiciously.
>
> I mean, what does judiciously even mean? what should the user consider
> to be judicious? Seems rather unclear to me. Admittedly, the old text
> was not much helpful, but at least it gave some advice.

Yeah, this definitely needs more work.

> But maybe more importantly, doesn't skipscan apply only to a rather
> limited subset of data types (that support increment/decrement)? Doesn't
> the new wording mostly ignore that, implying skipscan applies to all
> btree indexes? I don't think it mentions datatypes anywhere, but there
> are many indexes on data types like text, UUID and so on.

Actually, no, skip scan works in almost the same way with all data
types. Earlier versions of the patch didn't support every data type
(perhaps I should have waited for that before posting my v1), but the
version of the patch you looked at has no restrictions on any data
type.

You must be thinking of whether or not an opclass has skip support.
That's just an extra optimization, which can be used for a small
handful of discrete data types such as integer and date (hard to
imagine how skip support could ever be implemented for types like
numeric and text). There is a temporary testing GUC that will allow
you to get a sense of how much skip support can help: try "set
skipscan_skipsupport_enabled=off" with (say) my original MDAM test
query to get a sense of that. You'll see more buffer hits needed for
"next key probes", though not dramatically more.

It's worth having skip support (the idea comes from the MDAM paper),
but it's not essential. Whether or not an opclass has skip support
isn't accounted for by the cost model, but I doubt that it's worth
addressing (the cost model is already pessimistic).

> - Very subjective nitpicking, but I find it a bit strange when a comment
> about a block is nested in the block, like in _bt_first() for the
> array->null_elem check.

Will fix.

> - assignProcTypes() claims providing skipscan for cross-type scenarios
> doesn't make sense. Why is that? I'm not saying the claim is wrong, but
> it's not clear to me why would that be the case.

It is just talking about the support function that skip scan can
optionally use, where it makes sense (skip support functions). The
relevant "else if (member->number == BTSKIPSUPPORT_PROC)" stanza is
largely copied from the existing nearby "else if (member->number ==
BTEQUALIMAGE_PROC)" stanza that was added for B-Tree deduplication. In
both stanzas we're talking about a capability that maps to a
particular "input opclass", which means the opclass that maps to the
datums that are stored on disk, in index tuples.

There are no restrictions on the use of skip scan with queries that
happen to involve the use of cross-type operators. It doesn't even
matter if we happen to be using an incomplete opfamily, since range
skip arrays never need to *directly* take the current array element
from a lower/upper bound inequality scan key's argument. It all
happens indirectly: code in places like _bt_first and _bt_checkkeys
can use inequalities (which are stored in BTArrayKeyInfo.low_compare
and BTArrayKeyInfo.high_compare) to locate the next matching on-disk
index tuple that satisfies the inequality in question. Obviously, the
located datum must be the same type as the one used by the array and
its scan key (it has to be the input opclass type if it's taken from
an index tuple).

I think that it's a bit silly that nbtree generally bends over
backwards to find a way to execute a scan, given an incomplete
opfamily; in a green field situation it would make sense to just throw
an error instead. Even still, skip scan works in a way that is
maximally forgiving when incomplete opfamilies are used. Admittedly,
it is just about possible to come up with a scenario where we'll now
throw an error for a query that would have worked on Postgres 17. But
that's no different to what would happen if the query had an explicit
"= any( )" non-cross-type array instead of an implicit non-cross-type
skip array. The real problem in these scenarios is the lack of a
suitable cross-type ORDER proc (for a cross-type-operator query)
within _bt_first -- not the lack of cross-type operators. This issue
with missing ORDER procs just doesn't seem worth worrying about,
since, as I said, even slightly different queries (that don't use skip
scan) are bound to throw the same errors either way.

> Peter asked me to look at the costing, and I think it looks generally
> sensible.

I'm glad that you think that I basically have the right idea here.
Hard to know how to approach something like this, which doesn't have
any kind of precedent to draw on.

> We don't really have a lot of information to base the costing
> on in the first place - the whole point of skipscan is about multicolumn
> indexes, but none of the existing extended statistic seems very useful.
> We'd need some cross-column correlation info, or something like that.

Maybe, but that would just mean that we'd sometimes be more optimistic
about skip scan helping than we are with the current approach of
pessimistically assuming that there is no correlation at all. Not
clear that being pessimistic in this sense isn't the right thing to
do, despite the fact that it's clearly less accurate on average.

> There's one thing that I don't quite understand, and that's how
> btcost_correlation() adjusts correlation for multicolumn indexes:
>
> if (index->nkeycolumns > 1)
> indexCorrelation = varCorrelation * 0.75;
>
> That seems fine for a two-column index, I guess. But shouldn't it
> compound for indexes with more keys? I mean, 0.75 * 0.75 for third
> column, etc? I don't think btcostestimate() does that, it just remembers
> whatever btcost_correlation() returns.

I don't know either. In general I'm out of my comfort zone here.

> The only alternative approach I can think of is not to adjust the
> costing for the index scan at all, and only use this to enable (or not
> enable) the skipscan internally. That would mean the overall plan
> remains the same, and maybe sometimes we would think an index scan would
> be too expensive and use something else. Not great, but it doesn't have
> the risk of regressions - IIUC we can disable the skipscan at runtime,
> if we realize it's not really helpful.

In general I would greatly prefer to not have a distinct kind of index
path for scans that use skip scan. I'm quite keen on a design that
allows the scan to adapt to unpredictable conditions at runtime.

Of course, that doesn't preclude passing the index scan a hint about
what's likely to work at runtime, based on information figured out
when costing the scan. Perhaps that will prove necessary to avoid
regressing index scans that are naturally quite cheap already -- scans
where we really need to have the right general idea from the start to
avoid any regressions. I'm not opposed to that, provided the index
scan has the ability to change its mind when (for whatever reason) the
guidance from the optimizer turns out to be wrong.

> As usual, I wrote a bash script to do a bit of stress testing. It
> generates tables with random data, and then runs random queries with
> random predicates on them, while mutating a couple parameters (like
> number of workers) to trigger different plans. It does that on 16,
> master and with the skipscan patch (with the fix for parallel scans).

I wonder if some of the regressions you see can be tied to the use of
an LWLock in place of the existing use of a spin lock. I did that
because I sometimes need to allocate memory to deserialize the array
keys, with the exclusive lock held. It might be the case that a lot of
these regressions are tied to that, or something else that is far from
obvious...have to investigate.

In general, I haven't done much on parallel index scans here (I only
added support for them very recently), whereas your testing places a
lot of emphasis on parallel scans. Nothing wrong with that emphasis
(it caught that 17 bug), but just want to put it in context.

> I've uploaded the script and results from the last run here:
>
> https://github.com/tvondra/pg-skip-scan-tests
>
> There's the "run-mdam.sh" script that generates tables/queries, runs
> them, collects all kinds of info about the query, and produces files
> with explain plans, CSV with timings, etc.

It'll take me a while to investigate all this data.

> Anyway, I ran a couple thousand such queries, and I haven't found any
> incorrect results (the script compares that between versions too). So
> that's good ;-)

That's good!

> But my main goal was to see how this affects performance. The tables
> were pretty small (just 1M rows, maybe ~80MB), but with restarts and
> dropping caches, large enough to test this.

The really compelling cases all tend to involve fairly selective index
scans. Obviously, skip scan can only save work by navigating the index
structure more efficiently (unlike loose index scan). So if the main
cost is inherently bound to be the cost of heap accesses, then we
shouldn't expect a big speed up.

> For example, one of the slowed down queries is query 702 (top of page 8
> in the PDF). The query is pretty simple:
>
> explain (analyze, timing off, buffers off)
> select id1,id2 from t_1000000_1000_1_2
> where NOT (id1 in (:list)) AND (id2 = :value);
>
> and it was executed on a table with random data in two columns, each
> with 1000 distinct values. This is perfectly random data, so a great
> match for the assumptions in costing etc.
>
> But with uncached data, this runs in ~50 ms on master, but takes almost
> 200 ms with skipscan (these timings are from my laptop, but similar to
> the results).

I'll need to investigate this specifically. That does seem odd.

FWIW, it's a pity that the patch doesn't know how to push down the NOT
IN () here. The MDAM paper contemplates such a scheme. We see the use
of filter quals here, when in principle this could work by using a
skip array that doesn't generate elements that appear in the NOT IN()
list (it'd generate every possible indexable value *except* the given
list/array values). The only reason that I haven't implemented this
yet is because I'm not at all sure how to make it work on the
optimizer side. The nbtree side of the implementation will probably be
quite straightforward, since it's really just a slight variant of a
skip array, that excludes certain values.

> -- with skipscan
> Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
> t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16)
> (actual rows=811 loops=1)
> Index Cond: (id2 = 997)
> Index Searches: 1007
> Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
> Rows Removed by Filter: 163
> Heap Fetches: 0
> Planning Time: 3.730 ms
> Execution Time: 238.554 ms
> (8 rows)
>
> I haven't looked into why this is happening, but this seems like a
> pretty good match for skipscan (on the first column). And for the
> costing too - it's perfectly random data, no correllation, etc.

I wonder what "Buffers: N" shows? That's usually the first thing I
look at (that and "Index Searches", which looks like what you said it
should look like here). But, yeah, let me get back to you on this.

Thanks again!
--
Peter Geoghegan

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2024-09-17 22:50:27 Re: query_id, pg_stat_activity, extended query protocol
Previous Message a.imamov 2024-09-17 22:08:26 Custom connstr in background_psql()