Re: select count(distinct ...) is slower than select distinct in about 5x

From: jacket41142 <jacket41142(at)gmail(dot)com>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: select count(distinct ...) is slower than select distinct in about 5x
Date: 2013-12-12 03:22:37
Message-ID: CAONnt+6W2GS7BO+fry-2ayehWerVVeKv+THgGVByYsPUrrgwmw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

2013/12/11 Jeff Janes <jeff(dot)janes(at)gmail(dot)com>

> On Tuesday, December 10, 2013, jacket41142 wrote:
>
>> Thanks very much.
>>
>> I think another problem is that the cost estimation isn't good enough to
>> reflex real cost. Since we can see, from "explain analyze ...",
>> count(distinct ...) has smallest cost between the others, but since it uses
>> sorts, the time complexity should be higher especially for large amount of
>> rows.
>>
>
> That one is easy to explain. The cost estimate is not intended to be an
> absolute estimate, it is just an relative estimate to choose between
> alternatives. Since the current implementation of count(distinct ...)
> does not *have* any alternatives for that step in the process, there is no
> point in estimating a cost for it. So part of providing it with
> alternatives will have to be providing those cost estimates as well.
>

I got it. Thanks very much for explain.

>
>> Also I think even if we can have multiple count() expressions, the
>> optimizer should also be able to choose between use sort, HashAggregate or
>> maybe something like linear aggregate if sorts are not needed or other
>> methods if exist. Also this may be done as just one job for entire table of
>> interested columns, or for each column separately.
>>
>
> Right. I hope this gets fixed. It's been on my todo list for a while,
> but at the current rate of going through my todo list, it will takes a few
> decades to get to if it is left up to me....
>

Thanks very much for your effort. Also it's still good to know for me that
this problem will be fixed in future. :)
And so until now, if someone want to use count(distinct ...), he can use a
workaround like subquery if performance is a concern. (Of course, he also
needs to take care about NULL values as mentioned in
http://www.postgresql.org/message-id/flat/CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw(at)mail(dot)gmail(dot)com#CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com
)

best regards,
jacket41142

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2013-12-12 03:39:57 Re: Debugging shared memory issues on CentOS
Previous Message Janek Sendrowski 2013-12-12 01:00:38 Re: ORDER BY using index, tsearch2