Re: Optimizing count(), but Explain estimates wildly off

From: Vitalii Tymchyshyn <vit(at)tym(dot)im>
To: Chema <chema(at)interneta(dot)org>
Cc: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Re: Optimizing count(), but Explain estimates wildly off
Date: 2024-02-27 04:36:13
Message-ID: CABWW-d3aP-yvhMVCoBGWQWan=189K8yTY42My70RK=pExWxKLA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Is your transaction id more or less monotonic according to the date? If so,
something like the next can help:

with tenders_filtered as (select * from tenders where country = 'Mexico'
and "date" >= '2023-01-01' and "date" < '2024-01-01')
Select * from tenders_filtered inner join items on transaction_id =
tender_transaction_id
where tender_transaction_id between (select min(transaction_id) from
tenders_filtered) and (select max(transaction_id) from tenders_filtered)

This assumes you have an index on items(tender_transaction_id) and it would
be able to select a small subset (less than say 5%) of the table.
If your transaction_id is not monotonic, you can consider having something
monotonic or even additional denormalized field(s) with country and/or date
to your items.

Another option is to use a windowing function to get the count, e.g.
Select *,count(*) OVER () as cnt from tenders inner join items on
transaction_id = tender_transaction_id
where country = 'Colombia'
and "date" >= '2023-01-01' and "date" < '2024-01-01'

This would at least save you from doing a second call.

пн, 26 лют. 2024 р. о 16:26 Chema <chema(at)interneta(dot)org> пише:

> Dear pgsqlers,
>
> I'm trying to optimize simple queries on two tables (tenders & items) with
> a couple million records. Besides the resulting records, the app also
> displays the count of total results. Doing count() takes as much time as
> the other query (which can be 30+ secs), so it's an obvious target for
> optimization. I'm already caching count() results for the most common
> conditions (country & year) in a material table, which practically halves
> response time. The tables are updated sparingly, and only with bulk
> COPYs. Now I'm looking for ways to optimize queries with other conditions.
>
> Reading around, seems many people are still using this 2005 snippet
> <https://www.postgresql.org/message-id/20050810133157.GA46247@winnie.fuhr.org>
> to obtain the row count estimate from Explain:
>
> CREATE FUNCTION count_estimate(query text) RETURNS integer AS $$DECLARE
> rec record;
> rows integer;BEGIN
> FOR rec IN EXECUTE 'EXPLAIN ' || query LOOP
> rows := substring(rec."QUERY PLAN" FROM ' rows=([[:digit:]]+)');
> EXIT WHEN rows IS NOT NULL;
> END LOOP;
> RETURN rows;END;$$ LANGUAGE plpgsql VOLATILE STRICT;
>
> Is this still the current best practice? Any tips to increase precision?
> Currently it can estimate the actual number of rows for over *or* under a
> million, as seen on the sample queries (1,955,297 instead of 1,001,200;
> 162,080 instead of 1,292,010).
>
> Any other tips to improve the query are welcome, of course. There's a big
> disparity between the two sample queries plans even though only the
> filtered country changes.
>
> I already raised default_statistics_target up to 2k (the planner wasn't
> using indexes at all with low values). Gotta get it even higher? These are
> my custom settings:
>
> shared_buffers = 256MB # min 128kB
> work_mem = 128MB # min 64kB
> maintenance_work_mem = 254MB # min 1MB
> effective_cache_size = 2GB
> default_statistics_target = 2000
> random_page_cost = 1.0 # same scale as above
>
> Sample query:
>
> Explain Analyze
> Select * from tenders inner join items on transaction_id =
> tender_transaction_id
> where country = 'Colombia'
> and "date" >= '2023-01-01' and "date" < '2024-01-01'
> QUERY PLAN
> Gather (cost=253837.99..1506524.32 rows=1955297 width=823) (actual
> time=51433.592..63239.809 rows=1001200 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> Parallel Hash Join (cost=252837.99..1309994.62 rows=814707
> width=823) (actual time=51361.920..61729.142 rows=333733 loops=3)
> Hash Cond: (items.tender_transaction_id = tenders.transaction_id)
> -> Parallel Seq Scan on items (cost=0.00..1048540.46
> rows=3282346 width=522) (actual time=1.689..56887.108 rows=2621681 loops=3)
> -> Parallel Hash (cost=247919.56..247919.56 rows=393475
> width=301) (actual time=2137.473..2137.476 rows=333733 loops=3)
> Buckets: 1048576 Batches: 1 Memory Usage: 219936kB
> -> Parallel Bitmap Heap Scan on tenders
> (cost=16925.75..247919.56 rows=393475 width=301) (actual
> time=385.315..908.865 rows=333733 loops=3)
> Recheck Cond: ((country = 'Colombia'::text) AND (date
> >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
> Heap Blocks: exact=24350
> -> Bitmap Index Scan on
> tenders_country_and_date_index (cost=0.00..16689.67 rows=944339 width=0)
> (actual time=423.213..423.214 rows=1001200 loops=1)
> Index Cond: ((country = 'Colombia'::text) AND
> (date >= '2023-01-01'::date) AND (date < '2024-01-01'::date))
> Planning Time: 12.784 ms
> JIT:
> Functions: 33
> Options: Inlining true, Optimization true, Expressions true, Deforming true
> Timing: Generation 14.675 ms, Inlining 383.349 ms, Optimization 1023.521
> ms, Emission 651.442 ms, Total 2072.987 ms
> Execution Time: 63378.033 ms
>
> Explain Analyze
> Select * from tenders inner join items on transaction_id =
> tender_transaction_id
> where country = 'Mexico'
> and "date" >= '2023-01-01' and "date" < '2024-01-01'
> QUERY PLAN
> Gather (cost=1000.99..414258.70 rows=162080 width=823) (actual
> time=52.538..7006.128 rows=1292010 loops=1)
> Workers Planned: 2
> Workers Launched: 2
> -> Nested Loop (cost=0.99..397050.70 rows=67533 width=823) (actual
> time=40.211..4087.081 rows=430670 loops=3)
> -> Parallel Index Scan using tenders_country_and_date_index on
> tenders (cost=0.43..45299.83 rows=32616 width=301) (actual
> time=4.376..59.760 rows=1218 loops=3)
> Index Cond: ((country = 'Mexico'::text) AND (date >=
> '2023-01-01'::date) AND (date < '2024-01-01'::date))
> -> Index Scan using items_tender_transaction_id_index on items
> (cost=0.56..10.67 rows=11 width=522) (actual time=0.321..3.035 rows=353
> loops=3655)
> Index Cond: (tender_transaction_id = tenders.transaction_id)
> Planning Time: 7.808 ms
> JIT:
> Functions: 27
> Options: Inlining false, Optimization false, Expressions true, Deforming
> true
> Timing: Generation 17.785 ms, Inlining 0.000 ms, Optimization 5.080 ms,
> Emission 93.274 ms, Total 116.138 ms
> Execution Time: 7239.427 ms
>
> Thanks in advance!
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Laurenz Albe 2024-02-27 07:40:43 Re: Optimizing count(), but Explain estimates wildly off
Previous Message Chema 2024-02-27 00:25:19 Optimizing count(), but Explain estimates wildly off