Re: Please help! Query jumps from 1s -> 4m

From: Justin Pryzby <pryzby(at)telsasoft(dot)com>
To: James Thompson <james(at)jthompson(dot)dev>
Cc: pgsql-performance(at)postgresql(dot)org, laurenz(dot)albe(at)cybertec(dot)at, dgrowleyml(at)gmail(dot)com
Subject: Re: Please help! Query jumps from 1s -> 4m
Date: 2020-05-05 21:35:54
Message-ID: 20200505213553.GF28974@telsasoft.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, May 05, 2020 at 10:10:18PM +0100, James Thompson wrote:
> I've managed to replicate this now with prepared statements. Thanks for all
> the guidance so far.
>
> The slowness occurs when the prepared statement changes to a generic plan.
>
> Initial plan:
> -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 table1alias2 (cost=0.56..2549.70 rows=70 width=36) (actual time=1.901..45.256 rows=65000 loops=1)
> Output: table1alias2.uniqueid
> Index Cond: ((table1alias2.col20 = '12345'::bigint) AND (table1alias2.col8 = ANY ('{c5986b02-3a02-4639-8147-f286972413ba,...98ed24b1-76f5-4b0e-bb94-86cf13a4809c}'::text[])))

The rowcount is off by a factor of 1000x.

> after 5 executions of the statement:
> -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1 table1alias2 (cost=0.56..17.23 rows=1 width=36) (actual time=125.344..126877.822 rows=65000 loops=1)
> Output: table1alias2.uniqueid
> Index Cond: (table1alias2.col20 = $1001)
> Filter: ((table1alias2.col8)::text = ANY ((ARRAY[$1, ...,$1000])::text[]))
> Rows Removed by Filter: 2670023

And the generic plan is cheaper than the previous, custom plan; but slower,
undoubtedly due to rowcount mis-estimate.

> The second plan looks worse to me as it's applying a filter rather than
> using an index condition? I don't understand why it's not part of the
> condition and also why this is so much slower though.
> If I force a retrieval of all index rows for col20 = '12345' using an
> ad-hoc query (below, which in my mind is what the 'bad' plan is doing),
> that only takes 2s (2.7 mil rows). Where's the difference?
>
> EXPLAIN (ANALYZE, BUFFERS, TIMING) SELECT COUNT(DISTINCT id) FROM table1
> WHERE datatype='TypeA' AND col20 = 12345;
> -> Index Only Scan using table1_typea_include_uniqueid_col16_idx on table1
> (cost=0.56..2762.95 rows=55337 width=8) (actual time=3.976..1655.645 rows=2735023 loops=1)

I see you're querying on datetype, which I think isn't in the original query
(but I'm not sure if it's just renamed?).

Underestimate usually means that the conditions are redundant or correlated.
You could mitigate that either by creating an index on both columns (or add
datatype to the existing index), or CREATE STATISTICS ndistinct on those
columns. Or maybe you just need to increase the stats target for col20 (?).
Then re-analyze.

--
Justin

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2020-05-05 21:42:48 Re: Please help! Query jumps from 1s -> 4m
Previous Message James Thompson 2020-05-05 21:10:18 Re: Please help! Query jumps from 1s -> 4m