Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Peter Geoghegan <pg(at)bowt(dot)ie>, Masahiro(dot)Ikeda(at)nttdata(dot)com
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date: 2024-09-16 22:05:47
Message-ID: 411dad30-47f1-4404-a906-09d8bd94b83b@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

I've been looking at this patch over the couple last days, mostly doing
some stress testing / benchmarking (hence the earlier report) and basic
review. I do have some initial review comments, and the testing produced
some interesting regressions (not sure if those are the cases where
skipscan can't really help, that Peter mentioned he needs to look into).

review
------

First, the review comments - nothing particularly serious, mostly just
cosmetic stuff:

1) v6-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

- I find the places that increment "nsearches" a bit random. Each AM
does it in entirely different place (at least it seems like that to me).
Is there a way make this a bit more consistent?

- I find this comment rather unhelpful:

uint64 btps_nsearches; /* instrumentation */

Instrumentation what? What's the counter for?

- I see _bt_first moved the pgstat_count_index_scan, but doesn't that
mean we skip it if the earlier code does "goto readcomplete"? Shouldn't
that still count as an index scan?

- show_indexscan_nsearches does this:

if (scanDesc && scanDesc->nsearches > 0)
ExplainPropertyUInteger("Index Searches", NULL,
scanDesc->nsearches, es);

But shouldn't it divide the count by nloops, similar to (for example)
show_instrumentation_count?

2) v6-0002-Normalize-nbtree-truncated-high-key-array-behavio.patch

- Admittedly very subjective, but I find the "oppoDirCheck" abbreviation
rather weird, I'd just call it "oppositeDirCheck".

3) v6-0003-Refactor-handling-of-nbtree-array-redundancies.patch

- nothing

4) v6-0004-Add-skip-scan-to-nbtree.patch

- indices.sgml seems to hahve typo "Intevening" -> "Intervening"

- It doesn't seem like a good idea to remove the paragraph about
multicolumn indexes and replace it with just:

Multicolumn indexes should be used judiciously.

I mean, what does judiciously even mean? what should the user consider
to be judicious? Seems rather unclear to me. Admittedly, the old text
was not much helpful, but at least it gave some advice.

But maybe more importantly, doesn't skipscan apply only to a rather
limited subset of data types (that support increment/decrement)? Doesn't
the new wording mostly ignore that, implying skipscan applies to all
btree indexes? I don't think it mentions datatypes anywhere, but there
are many indexes on data types like text, UUID and so on.

- Very subjective nitpicking, but I find it a bit strange when a comment
about a block is nested in the block, like in _bt_first() for the
array->null_elem check.

- assignProcTypes() claims providing skipscan for cross-type scenarios
doesn't make sense. Why is that? I'm not saying the claim is wrong, but
it's not clear to me why would that be the case.

costing
-------

Peter asked me to look at the costing, and I think it looks generally
sensible. We don't really have a lot of information to base the costing
on in the first place - the whole point of skipscan is about multicolumn
indexes, but none of the existing extended statistic seems very useful.
We'd need some cross-column correlation info, or something like that.

It's an interesting question - if we could collect some new statistics
for multicolumn indexes (say, by having a way to collect AM-specific
stats), what would we collect for skipscan?

There's one thing that I don't quite understand, and that's how
btcost_correlation() adjusts correlation for multicolumn indexes:

if (index->nkeycolumns > 1)
indexCorrelation = varCorrelation * 0.75;

That seems fine for a two-column index, I guess. But shouldn't it
compound for indexes with more keys? I mean, 0.75 * 0.75 for third
column, etc? I don't think btcostestimate() does that, it just remembers
whatever btcost_correlation() returns.

Anyway, the overall costing approach seems sensible, I think. It assumes
things we assume in general (columns/keys are considered independent),
which may be problematic, but this is the best we can do.

The only alternative approach I can think of is not to adjust the
costing for the index scan at all, and only use this to enable (or not
enable) the skipscan internally. That would mean the overall plan
remains the same, and maybe sometimes we would think an index scan would
be too expensive and use something else. Not great, but it doesn't have
the risk of regressions - IIUC we can disable the skipscan at runtime,
if we realize it's not really helpful.

If we're concerned about regressions, I think this would be the way to
deal with them. Or at least it's the best idea I have.

testing
-------

As usual, I wrote a bash script to do a bit of stress testing. It
generates tables with random data, and then runs random queries with
random predicates on them, while mutating a couple parameters (like
number of workers) to trigger different plans. It does that on 16,
master and with the skipscan patch (with the fix for parallel scans).

I've uploaded the script and results from the last run here:

https://github.com/tvondra/pg-skip-scan-tests

There's the "run-mdam.sh" script that generates tables/queries, runs
them, collects all kinds of info about the query, and produces files
with explain plans, CSV with timings, etc.

Not all of the queries end up using index scans - depending on the
predicates, etc. it might have to use seqscan. Or maybe it only uses
index scan because it's forced to by the enable_* options, etc.

Anyway, I ran a couple thousand such queries, and I haven't found any
incorrect results (the script compares that between versions too). So
that's good ;-)

But my main goal was to see how this affects performance. The tables
were pretty small (just 1M rows, maybe ~80MB), but with restarts and
dropping caches, large enough to test this.

And generally the results seem good. You can either inspect the CSV with
raw results (look at the script to undestand what the fields are), or
check the attached PDF with a pivot table summarizing them.

As usual, there's a heatmap on the right side, comparing the results for
different versions (first "master/16" and then "skipscan/master"). Green
means "speedup/good" and red meand "regression/bad".

Most of the places are "white" (no change) or not very far from it, or
perhaps "green". But there's also a bunch of red results, which means
regression (FWIW the PDF is filtered only to queries that would actually
use the executed plans without the GUCs).

Some of the red placees are for very short queries - just a couple ms,
which means it can easily be random noise, or something like that. But a
couple queries are much longer, and might deserve some investigation.
The easiest way is to look at the "QID" column in the row, which
identifies the query in the "query" CSV. Then look into the results CSV
for IDs of the runs (in the first "SEQ" column), and find the details in
the "analyze" log, which has all the plans etc.

Alternatively, use the .ods in the git repository, which allows drill
down to results (from the pivot tables).

For example, one of the slowed down queries is query 702 (top of page 8
in the PDF). The query is pretty simple:

explain (analyze, timing off, buffers off)
select id1,id2 from t_1000000_1000_1_2
where NOT (id1 in (:list)) AND (id2 = :value);

and it was executed on a table with random data in two columns, each
with 1000 distinct values. This is perfectly random data, so a great
match for the assumptions in costing etc.

But with uncached data, this runs in ~50 ms on master, but takes almost
200 ms with skipscan (these timings are from my laptop, but similar to
the results).

-- master
Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
t_1000000_1000_1_2 (cost=0.96..20003.96 rows=1719 width=16)
(actual rows=811 loops=1)
Index Cond: (id2 = 997)
Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
Rows Removed by Filter: 163
Heap Fetches: 0
Planning Time: 7.596 ms
Execution Time: 28.851 ms
(7 rows)

-- with skipscan
Index Only Scan using t_1000000_1000_1_2_id1_id2_idx on
t_1000000_1000_1_2 (cost=0.96..983.26 rows=1719 width=16)
(actual rows=811 loops=1)
Index Cond: (id2 = 997)
Index Searches: 1007
Filter: (id1 <> ALL ('{983,...,640}'::bigint[]))
Rows Removed by Filter: 163
Heap Fetches: 0
Planning Time: 3.730 ms
Execution Time: 238.554 ms
(8 rows)

I haven't looked into why this is happening, but this seems like a
pretty good match for skipscan (on the first column). And for the
costing too - it's perfectly random data, no correllation, etc.

regards

--
Tomas Vondra

Attachment Content-Type Size
optimal.pdf application/pdf 320.1 KB
q702.sql application/sql 1.5 KB
run-mdam.sh application/x-shellscript 23.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Florents Tselai 2024-09-16 22:39:08 Re: [PATCH] WIP: replace method for jsonpath
Previous Message Nathan Bossart 2024-09-16 21:16:12 Re: optimizing pg_upgrade's once-in-each-database steps