Quick Links

SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

From:	Dimitrios Apostolou <jimis(at)gmx(dot)net>
To:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Date:	2024-05-10 18:38:07
Message-ID:	7886a68f-b466-2131-1747-f69f0fb71a37@gmx.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hello list,

INTRO

I have a huge (multi-billion rows) table partitioned into 1000 partitions.
Around half of the partitions are full and the rest are empty, created in
advance ready to receive future incoming data. Postgres is 16.2.
Here are the relevant parts of the schema:

Each partition is made to keep entries with workitem_n in ranges
(0,20k), (20k,40k) and so on (k = kilo) up to 20000k.

PROBLEM

I noticed that the following query is very very slow (too long to wait for
it to finish):

SELECT DISTINCT workitem_n FROM test_runs_raw ORDER BY workitem_n DESC LIMIT 10;

What is remarkable, is that in 998 out of 1000 table scans it involves,
the planner does not use the index. Instead it chooses a sequential scan.
Here is the output from EXPLAIN:

Limit (cost=853891608.79..853891608.99 rows=10 width=4)
-> Unique (cost=853891608.79..853891612.79 rows=200 width=4)
-> Sort (cost=853891608.79..853891610.79 rows=800 width=4) Sort Key: test_runs_raw.workitem_n DESC
-> Gather (cost=853891488.22..853891570.22 rows=800 width=4)
Workers Planned: 4
-> HashAggregate (cost=853890488.22..853890490.22 rows=200 width=4)
Group Key: test_runs_raw.workitem_n
-> Parallel Append (cost=0.00..813118117.30 rows=16308948365 width=4)
-> Parallel Index Only Scan Backward using test_runs_raw__part_max9600k_pkey on test_runs_raw__part_max9600k test_runs_raw_480 (cost=0.57..1597355.10 rows=33623320 width=4)
-> Parallel Index Only Scan Backward using test_runs_raw__part_max10140k_pkey on test_runs_raw__part_max10140k test_runs_raw_507 (cost=0.57..1210795.63 rows=25793672 width=4)
-> Parallel Seq Scan on test_runs_raw__part_max9500k test_runs_raw_475 (cost=0.00..3037793.12 rows=64121612 width=4)
-> Parallel Seq Scan on test_runs_raw__part_max11180k test_runs_raw_559 (cost=0.00..2918875.90 rows=61612190 width=4)
[ ... 996 more sequential scans ... ]

If I remove DISTINCT then the plan changes dramatically and it runs
instantaneously:

Limit (cost=363.84..367.30 rows=10 width=4)
-> Append (cost=363.84..22527480551.58 rows=65235793929 width=4)
-> Index Only Scan Backward using test_runs_raw__part_max20000k_pkey on test_runs_raw__part_max20000k test_runs_raw_1000 (cost=0.12..2.34 rows=1 width=4)
-> Index Only Scan Backward using test_runs_raw__part_max19980k_pkey on test_runs_raw__part_max19980k test_runs_raw_999 (cost=0.12..2.34 rows=1 width=4)
-> Index Only Scan Backward using test_runs_raw__part_max19960k_pkey on test_runs_raw__part_max19960k test_runs_raw_998 (cost=0.12..2.34 rows=1 width=4)
-> Index Only Scan Backward using test_runs_raw__part_max19940k_pkey on test_runs_raw__part_max19940k test_runs_raw_997 (cost=0.12..2.34 rows=1 width=4)
[ ... 996 more index scans ... ]

Notice how in the last plan there is no parallel scanning. Instead the
partitions are scanned sequentially, *in proper order*,
so that the plan execution stops after reading the first
10 rows in the first non-empty partition.

Why can't the same be done with DISTINCT?
Please note that the workitem_n value range is well spread into in range
(0,13M) and the table has been gradually filled within one year, so I'm
assuming the vacuum worker has worked long enough to build sane statistics
(not sure how to verify that).

REMARKS

1. I tried reproducing the problem on an artificial table with few
partitions and few values, but I couldn't. Both queries execute fast,
and the planner is always choosing a non-parallel index-only scan.

2. Among testing changes to various settings, I just noticed that setting
max_parallel_workers_per_gather to 0 (from the original value of 4)
fixes the issue! On the original huge table, disabling parallelism
actually makes the query infinitely faster and it returns within 1s! Is
this a bug in the planner?

Thank you,
Dimitris

Responses

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions at 2024-05-10 19:24:00 from Dimitrios Apostolou

Browse pgsql-general by date

	From	Date	Subject
Next Message	Dimitrios Apostolou	2024-05-10 19:24:00	Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Previous Message	Daniel McKenzie	2024-05-10 12:11:42	Re: Unexpected data when subscribing to logical replication slot