Re: select count(distinct ...) is slower than select distinct in about 5x

From: jacket41142 <jacket41142(at)gmail(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: select count(distinct ...) is slower than select distinct in about 5x
Date: 2013-12-11 01:57:15
Message-ID: CAONnt+7DnM2meR27PWzfP0hdzL0KBRS3CFpMq0KPUqEiZbW7rA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

I have done another experiment for count(*) vs count(distinct ...), on same
table schema but with 10000000 rows now. And for this time, the postgres
version is 9.3.2 (9.3.2-1.pgdg12.4+1).
These two has same resulted query plan with same estimated cost, but
count(*) is straightly fast.

test=> explain analyze select count(*) from t1;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169247.92..169247.93 rows=1 width=0) (actual
time=37775.187..37775.188 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.00..144247.94 rows=9999994 width=0) (actual
time=0.037..19303.022 rows=10000000 loops=1)
Total runtime: 37775.216 ms
(3 筆資料列)

時間: 37775.493 ms
test=> explain analyze select count(distinct col_int) from t1;
QUERY
PLAN
------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169247.92..169247.93 rows=1 width=4) (actual
time=45883.192..45883.195 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.00..144247.94 rows=9999994 width=4) (actual
time=0.037..19652.540 rows=10000000 loops=1)
Total runtime: 45883.224 ms
(3 筆資料列)

時間: 45883.473 ms

test=> select count(*) from t1;
count
----------
10000000
(1 筆資料列)

時間: 602.018 ms
test=> select count(*) from t1;
count
----------
10000000
(1 筆資料列)

時間: 598.291 ms
test=> select count(*) from t1;
count
----------
10000000
(1 筆資料列)

時間: 592.439 ms

test=> select count(distinct col_int) from t1;
count
-------
1025
(1 筆資料列)

時間: 10311.788 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 筆資料列)

時間: 7063.156 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 筆資料列)

時間: 6899.283 ms

I don't think count(*) also uses sort since it should not be needed. But
for the query planner, it seems it can not distinguish between these two
now.

regards,
jacket41142

2013/12/11 jacket41142 <jacket41142(at)gmail(dot)com>

> Thanks very much.
>
> I think another problem is that the cost estimation isn't good enough to
> reflex real cost. Since we can see, from "explain analyze ...",
> count(distinct ...) has smallest cost between the others, but since it uses
> sorts, the time complexity should be higher especially for large amount of
> rows.
>
> Also I think even if we can have multiple count() expressions, the
> optimizer should also be able to choose between use sort, HashAggregate or
> maybe something like linear aggregate if sorts are not needed or other
> methods if exist. Also this may be done as just one job for entire table of
> interested columns, or for each column separately.
>
> regards,
> jacket41142
>
>
> 2013/12/11 Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
>
>> On Tue, Dec 10, 2013 at 9:28 AM, jacket41142 <jacket41142(at)gmail(dot)com>wrote:
>>
>>
>>>
>>> test=> select distinct col_int from t1 group by col_int;
>>> Time: 1177.936 ms
>>>
>>> So the performance difference is not very large.
>>> But when I do that:
>>>
>>> test=> select count(distinct col_int) from t1;
>>> count
>>> -------
>>> 1025
>>> (1 row)
>>>
>>> Time: 7367.476 ms
>>>
>>
>>
>> count(distinct ...) always sorts, rather than using a hash, to do its
>> work. I don't think that there is any fundamental reason that it could not
>> be changed to allow it to use hashing, it just hasn't been done yet. It is
>> complicated by the fact that you can have multiple count() expressions in
>> the same query which demand sorting/grouping on different columns.
>>
>> Cheers,
>>
>> Jeff
>>
>
>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Jeff Janes 2013-12-11 02:24:57 Re: select count(distinct ...) is slower than select distinct in about 5x
Previous Message jacket41142 2013-12-11 01:23:03 Re: select count(distinct ...) is slower than select distinct in about 5x