Quick Links

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions

From:	Dimitrios Apostolou <jimis(at)gmx(dot)net>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	pgsql-general(at)lists(dot)postgresql(dot)org
Subject:	Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Date:	2024-05-11 01:10:50
Message-ID:	6a2f3906-3d7a-6924-7403-8f77d57a18e4@gmx.net
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Fri, 10 May 2024, Tom Lane wrote:

> Dimitrios Apostolou <jimis(at)gmx(dot)net> writes:
>> Further digging into this simple query, if I force the non-parallel plan
>> by setting max_parallel_workers_per_gather TO 0, I see that the query
>> planner comes up with a cost much higher:
>
>> Limit (cost=363.84..1134528847.47 rows=10 width=4)
>> -> Unique (cost=363.84..22690570036.41 rows=200 width=4)
>> -> Append (cost=363.84..22527480551.58 rows=65235793929 width=4)
>> ...
>
>> The total cost on the 1st line (cost=363.84..1134528847.47) has a much
>> higher upper limit than the total cost when
>> max_parallel_workers_per_gather is 4 (cost=853891608.79..853891608.99).
>> This explains the planner's choice. But I wonder why the cost estimation
>> is so far away from reality.
>
> I'd say the blame lies with that (probably-default) estimate of
> just 200 distinct rows. That means the planner expects to have
> to read about 5% (10/200) of the tables to get the result, and
> that's making fast-start plans look bad.

Indeed that's an awful estimate, the table has more than 1M of unique
values in that column. Looking into pg_stat_user_tables, I can't see the
partitions having been vacuum'd or analyzed at all. I think they should
have been auto-analyzed, since they get a ton of INSERTs
(no deletes/updates though) and I have the default autovacuum settings.
Could it be that autovacuum starts, but never
finishes? I can't find something in the logs.

In any case, even after the planner decides to execute the terrible plan
with the parallel seqscans, why doesn't it finish right when it finds 10
distinct values?

>
> Possibly an explicit ANALYZE on the partitioned table would help.

Thanks, I'll save the ANALYZE as the last step; I feel it's a good
opportunity to figure out more details about how postgres works. Plus I
expect ANALYZE to last a couple of days, so I should first find quiet time
for that. :-)

Dimitris

In response to

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions at 2024-05-10 20:22:48 from Tom Lane

Responses

Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions at 2024-05-11 01:33:38 from Tom Lane
Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions at 2024-05-11 01:35:44 from David Rowley

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2024-05-11 01:33:38	Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions
Previous Message	Tom Lane	2024-05-10 20:22:48	Re: SELECT DISTINCT chooses parallel seqscan instead of indexscan on huge table with 1000 partitions