Re: weird execution plan

From: "Huang, Suya" <Suya(dot)Huang(at)au(dot)experian(dot)com>
To: David G Johnston <david(dot)g(dot)johnston(at)gmail(dot)com>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: weird execution plan
Date: 2014-09-15 00:39:39
Message-ID: D83E55F5F4D99B4A9B4C4E259E6227CD01EBF653@AUX1EXC02.apac.experian.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

-----Original Message-----
From: pgsql-performance-owner(at)postgresql(dot)org [mailto:pgsql-performance-owner(at)postgresql(dot)org] On Behalf Of David G Johnston
Sent: Saturday, September 13, 2014 7:34 AM
To: pgsql-performance(at)postgresql(dot)org
Subject: Re: [PERFORM] weird execution plan

>Not everyone does so its nice to make certain - especially since I'm not all that familiar with the code involved. But since no one else has answered I will theorize.
>
>SELECT count(*) FROM ( SELECT DISTINCT col FROM tbl )
>
>vs
>
>SELECT count(DISTINCT col) FROM tbl
>
>The code for "SELECT DISTINCT col" is likely highly efficient because it works on complete sets of records.
>
>The code for "SELECT count(DISTINCT col)" is at a relative disadvantage since it must evaluate one row at a time and remember whether it had seen the same value previously before deciding whether to >increment a counter.
>
>With a large number of duplicate rows the process of making the row set smaller before counting the end result will perform better since fewer rows must be evaluated in the less efficient count(DISTINCT) >expression - the time saved there more than offset by the fact that you are effectively passing over that subset of the data a second time.
>
>HashAggregate(1M rows) + Aggregate(200k rows) < Aggregate(1M rows)
>
>David J.

Thanks David!

I am so surprised to the findings you put here. Just did an explain plan on the example you gave and pasted the result below, you're correct.

"Select count(distinct col1)" is really a very common SQL statement we write daily, in Postgres, we need to rewrite it so that the aggregate doesn't happen on a very large data sets... I am wondering if this is something to be improved from the optimizer ifself, instead of developers to rewrite SQL. Like having the optimizer just do the counting in the end instead of doing it each time. I used Oracle before, never saw this issue...

But really thank you for pointing this out, very valuable lesson-learnt in PG SQL writing for me and our developers.

dev=# explain analyze select count(*) from (select distinct wid from terms_weekly) foo;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1278656.00..1278656.01 rows=1 width=0) (actual time=24316.335..24316.336 rows=1 loops=1)
-> HashAggregate (cost=1278651.50..1278653.50 rows=200 width=42) (actual time=23899.916..24242.010 rows=1298124 loops=1)
-> Append (cost=0.00..1171738.20 rows=42765321 width=42) (actual time=0.028..13631.898 rows=42765320 loops=1)
-> Seq Scan on search_terms_weekly (cost=0.00..0.00 rows=1 width=516) (actual time=0.001..0.001 rows=0 loops=1)
-> Seq Scan on search_terms_weekly_20140503 (cost=0.00..293352.90 rows=10702190 width=42) (actual time=0.026..2195.460 rows=10702190 loops=1)
-> Seq Scan on search_terms_weekly_20140510 (cost=0.00..298773.53 rows=10878953 width=42) (actual time=8.244..3163.087 rows=10878953 loops=1)
-> Seq Scan on search_terms_weekly_20140517 (cost=0.00..288321.17 rows=10537717 width=41) (actual time=7.345..2520.531 rows=10537717 loops=1)
-> Seq Scan on search_terms_weekly_20140524 (cost=0.00..291290.60 rows=10646460 width=41) (actual time=8.543..2693.833 rows=10646460 loops=1)
Total runtime: 24333.830 ms
(9 rows)

dev=# explain analyze select count(distinct wid) from terms_weekly;

QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=1278651.50..1278651.51 rows=1 width=42) (actual time=585774.511..585774.511 rows=1 loops=1)
-> Append (cost=0.00..1171738.20 rows=42765321 width=42) (actual time=0.019..10656.782 rows=42765320 loops=1)
-> Seq Scan on search_terms_weekly (cost=0.00..0.00 rows=1 width=516) (actual time=0.002..0.002 rows=0 loops=1)
-> Seq Scan on search_terms_weekly_20140503 (cost=0.00..293352.90 rows=10702190 width=42) (actual time=0.017..2225.397 rows=10702190 loops=1)
-> Seq Scan on search_terms_weekly_20140510 (cost=0.00..298773.53 rows=10878953 width=42) (actual time=0.009..2244.918 rows=10878953 loops=1)
-> Seq Scan on search_terms_weekly_20140517 (cost=0.00..288321.17 rows=10537717 width=41) (actual time=0.008..1822.088 rows=10537717 loops=1)
-> Seq Scan on search_terms_weekly_20140524 (cost=0.00..291290.60 rows=10646460 width=41) (actual time=0.006..1561.229 rows=10646460 loops=1)
Total runtime: 585774.568 ms
(8 rows)

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Van Der Berg, Stefan 2014-09-15 09:38:30 Strange performance problem with query
Previous Message David G Johnston 2014-09-12 21:34:28 Re: weird execution plan