Re: Adding skip scan (including MDAM style range skip scan) to nbtree

From: Masahiro Ikeda <ikedamsh(at)oss(dot)nttdata(dot)com>
To: Peter Geoghegan <pg(at)bowt(dot)ie>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Masahiro(dot)Ikeda(at)nttdata(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org, Masao(dot)Fujii(at)nttdata(dot)com
Subject: Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Date: 2024-11-19 08:30:56
Message-ID: 51d00219180323d121572e1f83ccde2a@oss.nttdata.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Apologies for the delayed response. I've confirmed that the costing is
significantly
improved for multicolumn indexes in the case I provided. Thanks!
https://www.postgresql.org/message-id/TYWPR01MB10982A413E0EC4088E78C0E11B1A62%40TYWPR01MB10982.jpnprd01.prod.outlook.com

I have some comments and questions regarding the v14 patch.

(1)

> v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

As a user, I agree with adding a counter for skip scans
(though it might be better to reply to the different thread).

I believe this will help users determine whether the skip scan
optimization is
effective. If the ratio of (actual rows)/(index searches) becomes small,
it
suggests that many traversals are occurring, indicating that the skip
scan
optimization is not working efficiently. In such cases, users might
benefit
from replacing the multi-column index with an optimized leading column
index.

IIUC, why not add it to the documentation? It would clearly help users
understand how to tune their queries using the counter, and it would
also show that the counter is not just for developers.

(2)

> v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

From the perspective of consistency, wouldn't it be better to align the
naming
between the EXPLAIN output and pg_stat_all_indexes.idx_scan, even though
the
documentation states they refer to the same concept?

I personally prefer something like "search" instead of "scan", as "scan"
is
commonly associated with node names like Index Scan and similar terms.
To maintain
consistency, how about renaming pg_stat_all_indexes.idx_scan to
pg_stat_all_indexes.idx_search?

(3)

> v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch

The counter should be added in blgetbitmap().

(4)

> v14-0001-Show-index-search-count-in-EXPLAIN-ANALYZE.patch
> doc/src/sgml/bloom.sgml

The below forgot "Index Searches: 1".

-&gt; Bitmap Index Scan on btreeidx2 (cost=0.00..12.04
rows=500 width=0) (never executed)
Index Cond: (i2 = 898732)
Planning Time: 0.491 ms
Execution Time: 0.055 ms
(10 rows)

Although we may not need to fix it, due to the support for skip scan,
the B-tree
index is now selected over the Bloom index in my environment.

(5)

> v14-0002-Add-skip-scan-to-nbtree.patch

Although I tested with various data types such as int, uuid, oid, and
others on my
local PC, I could only identify the regression case that you already
mentioned.

The case is that Index(Only)Scan is selected instead of SeqScan.
Although the performance
of the full index scan is almost the same as that of SeqScan, the skip
scan degrades by
a factor of 4, despite the number of accessed pages being almost the
same. The flamegraph
indicates that the ratio of _bt_advance_array_keys() and
_bt_tuple_before_array_skeys()
becomes high.

Although it's not an optimal solution and would only reduce the degree
of performance
degradation, how about introducing a threshold per page to switch from
skip scan to full
index scan? For example, switch to full index scan after
_bt_tuple_before_array_skeys() in
_bt_checkkeys() fails 30 times per page. While 30 is just an example, I
referenced the
following case, which seems close to the worst-case scenario. In that
case,
_bt_advance_array_keys() and _bt_tuple_before_array_skeys() are called
almost 333 times per
page (1,000,000 records / 2,636 pages), and 30 is about 1/10 of that
number.

# Example case (see test.sql)
# master: 51.612ms
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..10633.43 rows=1 width=8) (actual
time=0.160..51.612 rows=1 loops=1)
Output: id1, id2
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=4425
-> Parallel Seq Scan on public.t (cost=0.00..9633.33 rows=1
width=8) (actual time=31.728..48.378 rows=0 loops=3)
Output: id1, id2
Filter: (t.id2 = 100)
Rows Removed by Filter: 333333
Buffers: shared hit=4425
Worker 0: actual time=47.583..47.583 rows=0 loops=1
Buffers: shared hit=1415
Worker 1: actual time=47.568..47.569 rows=0 loops=1
Buffers: shared hit=1436
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.078 ms
Execution Time: 51.630 ms
(17 rows)

# master with v14patch: 199.328ms. 4x slower
SET skipscan_prefix_cols=32;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_idx on public.t (cost=0.42..3535.75 rows=1
width=8) (actual time=0.090..199.328 rows=1 loops=1)
Output: id1, id2
Index Cond: (t.id2 = 100)
Index Searches: 1
Heap Fetches: 0
Buffers: shared hit=2736
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.069 ms
Execution Time: 199.348 ms
(9 rows)

# master with v14patch: 54.665ms.
SET skipscan_prefix_cols=0; // disable skip scan
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------
Index Only Scan using t_idx on public.t (cost=0.42..3535.75 rows=1
width=8) (actual time=0.031..54.665 rows=1 loops=1)
Output: id1, id2
Index Cond: (t.id2 = 100)
Index Searches: 1
Heap Fetches: 0
Buffers: shared hit=2736
Settings: effective_cache_size = '7932MB', work_mem = '15MB'
Planning Time: 0.067 ms
Execution Time: 54.685 ms
(9 rows)

(6)

> v14-0002-Add-skip-scan-to-nbtree.patch
> src/backend/access/nbtree/nbtutils.c

+static int
+_bt_preprocess_num_array_keys(IndexScanDesc scan, BTSkipPreproc
*skipatts,
+ int *numSkipArrayKeys)
(snip)
+ int prev_numSkipArrayKeys = *numSkipArrayKeys;
+
+ /*
+ * Backfill skip arrays for any wholly omitted attributes prior to
+ * attno_inkey
+ */
+ while (attno_skip < attno_inkey)

Is it better to move prev_numSkipArrayKeys =*numSkipArrayKeys after the
while loop?
For example, the index below should return *numSkipArrayKeys = 0 instead
of 1
if the id3 type does not support eq_op.

* index: CREATE INDEX test_idx on TEST (id1 int, id2 int, id3 no_eq_op,
id4 int);
* query: SELECT * FROM test WHERE id4 = 10;

Regards,
--
Masahiro Ikeda
NTT DATA CORPORATION

Attachment Content-Type Size
test.sql text/plain 660 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alvaro Herrera 2024-11-19 08:45:43 Re: README.tuplock and SHARE lock
Previous Message Nisha Moond 2024-11-19 07:20:51 Re: Introduce XID age and inactive timeout based replication slot invalidation