Re: Optimizing count(), but Explain estimates wildly off

From: Chema <chema(at)interneta(dot)org>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Optimizing count(), but Explain estimates wildly off
Date: 2024-02-29 23:15:42
Message-ID: CALdEsqPEouM2tdgrYp--iLA8n7fJf85w5HtBRRqAAA=xBB_sHw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

>
> > Reading around, seems many people are still using this 2005 snippet to
> obtain the
> > row count estimate from Explain:
>
> I recommend using FORMAT JSON and extracting the top row count from that.
> It is
> simpler and less error-prone.
>
Good tip, thanks Laurenze!

>
> > Is this still the current best practice? Any tips to increase precision?
> > Currently it can estimate the actual number of rows for over or under a
> million,
> > as seen on the sample queries (1,955,297 instead of 1,001,200; 162,080
> instead
> > of 1,292,010).
>
> Looking at the samples you provided, I get the impression that the
> statistics for
> the table are quite outdated. That will affect the estimates. Try
> running ANALYZE
> and see if that improves the estimates.
>
>
No major changes after doing Analyze, and also Vacuum Analyze. Seems
something is seriously off. I pimped my config thanks to Alvaro's
prompting, set default statistics = 500 (suggested for warehouse dbs) but
raised pertinent columns from 2,000 to 5,000 (will play with disabling JIT
or raising cost later):

shared_buffers = 2GB # ~0.25 * RAM, dedicated cache, hard
allocation (requires restart)
effective_cache_size = 6GB # 0.5-0.75 RAM (free -h: free + cache +
shared_buffers)
work_mem = 128MB # RAM * 0.25 / max_connections.
maintenance_work_mem = 512MB
default_statistics_target = 500 # def 100, higher to make planner use
indexes in big warehouse tables.
random_page_cost = 1.1 # Random reads in SSD cost almost as
little as sequential ones

Analized again (1.5M samples instead of 600k):
"tenders": scanned 216632 of 216632 pages, containing 3815567 live rows and
0 dead rows; 1500000 rows in sample, 3815567 estimated total rows
"items": scanned 995023 of 995023 pages, containing 7865043 live rows and 0
dead rows; 1500000 rows in sample, 7865043 estimated total rows

but same deal:

-- After config pimp 1,959,657 instead of 1,001,200 45,341.654 ms

Gather (cost=247031.70..1479393.82 rows=1959657 width=824) (actual time
=8464.691..45257.435 rows=1001200 loops=1)

Workers Planned: 2

Workers Launched: 2

-> Parallel Hash Join (cost=246031.70..1282428.12 rows=816524 width=824) (
actual time=8413.057..44614.153 rows=333733 loops=3)

Hash Cond: (pricescope_items.tender_transaction_id = pricescope_tenders.
transaction_id)

-> Parallel Seq Scan on pricescope_items (cost=0.00..1027794.01 rows=3277101
width=522) (actual time=0.753..41654.507 rows=2621681 loops=3)

-> Parallel Hash (cost=241080.20..241080.20 rows=396120 width=302) (actual
time=995.247..995.250 rows=333733 loops=3)

Buckets: 1048576 Batches: 1 Memory Usage: 219904kB

-> Parallel Bitmap Heap Scan on pricescope_tenders (cost=17516.10..241080.20
rows=396120 width=302) (actual time=162.898..321.472 rows=333733 loops=3)

Recheck Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date)
AND (date < '2024-01-01'::date))

Heap Blocks: exact=34722

-> Bitmap Index Scan on pricescope_tenders_country_and_date_index (cost
=0.00..17278.43 rows=950688 width=0) (actual time=186.536..186.537 rows=
1001200 loops=1)

Index Cond: ((country = 'Colombia'::text) AND (date >= '2023-01-01'::date)
AND (date < '2024-01-01'::date))

Planning Time: 11.310 ms

JIT:

Functions: 33

Options: Inlining true, Optimization true, Expressions true, Deforming true

Timing: Generation 8.608 ms, Inlining 213.375 ms, Optimization 557.351 ms,
Emission 417.568 ms, Total 1196.902 ms

Execution Time: 45341.654 ms

BUT if I force the planner to ignore 'country' statistics:

-- Subselect country to hide constant from planner, so it doesn't use
statistics

Explain Analyze

Select * from pricescope_tenders inner join pricescope_items on
transaction_id = tender_transaction_id

where country = (select 'Colombia')

and "date" >= '2023-01-01' and "date" < '2024-01-01'

;

Then I get the same plan than if I filter for Mexico, with a similar run
time:

-- Colombia in subselect 428,623 instead of 1,001,200 6674.860 ms

Gather (cost=1001.00..570980.73 rows=428623 width=824) (actual time
=166.785..6600.673 rows=1001200 loops=1)

Workers Planned: 2

Params Evaluated: $0

Workers Launched: 2

InitPlan 1 (returns $0)

-> Result (cost=0.00..0.01 rows=1 width=32) (actual time=166.031..166.033
rows=1 loops=1)

-> Nested Loop (cost=0.99..527118.42 rows=178593 width=824) (actual time
=200.511..5921.585 rows=333733 loops=3)

-> Parallel Index Scan using pricescope_tenders_country_and_date_index on
pricescope_tenders (cost=0.43..104391.64 rows=86641 width=302) (actual time
=200.388..400.882 rows=333733 loops=3)

Index Cond: ((country = $0) AND (date >= '2023-01-01'::date) AND (date <
'2024-01-01'::date))

-> Index Scan using pricescope_items_tender_transaction_id_index on
pricescope_items (cost=0.56..4.83 rows=5 width=522) (actual time=0.016..
0.016 rows=1 loops=1001200)

Index Cond: (tender_transaction_id = pricescope_tenders.transaction_id)

Planning Time: 7.372 ms

JIT:

Functions: 31

Options: Inlining true, Optimization true, Expressions true, Deforming true

Timing: Generation 6.981 ms, Inlining 209.470 ms, Optimization 308.123 ms,
Emission 248.176 ms, Total 772.750 ms

Execution Time: 6674.860 ms

So runtime is now decent; stats are still way off by -670k, tho I guess
that's better than +1M.

1. Any tips to fix stats?
2. Or a better way of making the planner go for index scans for country?

Thanks again!

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message James Pang 2024-03-01 06:56:38 Re: FW: huge SubtransSLRU and SubtransBuffer wait_event
Previous Message Anupam b 2024-02-29 18:11:33 Re: Table Partitioning and Indexes Performance Questions