Re: Optimizing count(), but Explain estimates wildly off

From: Greg Sabino Mullane <htamfids(at)gmail(dot)com>
To: Chema <chema(at)interneta(dot)org>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Optimizing count(), but Explain estimates wildly off
Date: 2024-03-05 17:04:17
Message-ID: CAKAnmmJSstuCamBjd5N8o9Y-EzZHawZFC6ofvuA0nHAcw1pWWQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

> columns has not improved the planner's estimates, which are off by almost
> 1M, and there's been no suggestion of what could cause that.

You are asking a lot of the planner - how would it know that the average
number of items is much higher for ids derived indirectly from "Mexico"
versus ids derived from "Columbia"?

One thing you could try just as a general performance gain is index-only
scans, by creating an index like this:

create index tenders_date_country_id on tenders (country, "date") include
(transaction_id);

>> Parallel Seq Scan on pricescope_items (cost=0.00..1027794.01
rows=3277101 width=522)
>> (actual time=0.753..41654.507 rows=2621681 loops=3)
> Why does it take over 41 seconds to read a table with less than 3 million
rows?

Good question. I still maintain it's because you are doing a 'select star'
on large, toasted rows.

I made two tables of the same approximate number of rows, and ran the
query. It returned a hash join containing:

-> Parallel Seq Scan on items (cost=0.00..69602.93 rows=3375592 width=8)
(actual time=0.015..185.414 rows=2700407 loops=3)

Then I boosted the width by a lot by adding some filled text columns, and
it returned the same number of rows, but much slower:

-> Parallel Seq Scan on items (cost=0.00..1729664.15 rows=3370715
width=1562)
(actual time=0.027..36693.986 rows=2700407 loops=3)

A second run with everything in cache was better, but still an order of
magnitude worse the small row:

-> Parallel Seq Scan on items (cost=0.00..1729664.15 rows=3370715
width=1562)
(actual time=0.063..1565.486 rows=2700407 loops=3)

Best of all was a "SELECT 1" which switched the entire plan to a much
faster merge join, resulting in:

-> Parallel Index Only Scan using items_tender_transaction_id_index on
items (cost=0.43..101367.60 rows=3372717 width=4)
(actual time=0.087..244.878 rows=2700407 loops=3)

Yours will be different, as I cannot exactly duplicate your schema or data
distribution, but give "SELECT 1" a try. This was on Postgres 16, FWIW,
with a default_statistics_target of 100.

Cheers,
Greg

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Lok P 2024-03-05 19:05:02 Re: Is partition pruning impacted by data type
Previous Message Chema 2024-03-05 14:00:00 Re: Optimizing count(), but Explain estimates wildly off