From: | jacket41142 <jacket41142(at)gmail(dot)com> |
---|---|
To: | "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: select count(distinct ...) is slower than select distinct in about 5x |
Date: | 2013-12-12 03:22:37 |
Message-ID: | CAONnt+6W2GS7BO+fry-2ayehWerVVeKv+THgGVByYsPUrrgwmw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
2013/12/11 Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
> On Tuesday, December 10, 2013, jacket41142 wrote:
>
>> Thanks very much.
>>
>> I think another problem is that the cost estimation isn't good enough to
>> reflex real cost. Since we can see, from "explain analyze ...",
>> count(distinct ...) has smallest cost between the others, but since it uses
>> sorts, the time complexity should be higher especially for large amount of
>> rows.
>>
>
> That one is easy to explain. The cost estimate is not intended to be an
> absolute estimate, it is just an relative estimate to choose between
> alternatives. Since the current implementation of count(distinct ...)
> does not *have* any alternatives for that step in the process, there is no
> point in estimating a cost for it. So part of providing it with
> alternatives will have to be providing those cost estimates as well.
>
I got it. Thanks very much for explain.
>
>> Also I think even if we can have multiple count() expressions, the
>> optimizer should also be able to choose between use sort, HashAggregate or
>> maybe something like linear aggregate if sorts are not needed or other
>> methods if exist. Also this may be done as just one job for entire table of
>> interested columns, or for each column separately.
>>
>
> Right. I hope this gets fixed. It's been on my todo list for a while,
> but at the current rate of going through my todo list, it will takes a few
> decades to get to if it is left up to me....
>
Thanks very much for your effort. Also it's still good to know for me that
this problem will be fixed in future. :)
And so until now, if someone want to use count(distinct ...), he can use a
workaround like subquery if performance is a concern. (Of course, he also
needs to take care about NULL values as mentioned in
http://www.postgresql.org/message-id/flat/CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw(at)mail(dot)gmail(dot)com#CAPNY-2Utce-c+kNTwsMCbAk58=9mYeAeViTXT9LO7r1k77jukw@mail.gmail.com
)
best regards,
jacket41142
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2013-12-12 03:39:57 | Re: Debugging shared memory issues on CentOS |
Previous Message | Janek Sendrowski | 2013-12-12 01:00:38 | Re: ORDER BY using index, tsearch2 |