Re: select DISTINCT

From: pg noob <pgnube(at)gmail(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: select DISTINCT
Date: 2013-09-09 14:10:35
Message-ID: CAPNY-2VeR=-RL0dzLnHqZ1ATe9jD2q53zfXrFFNXG2ig_3OXcQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thank you Kevin and Jeff for the responses.
These are very helpful.

On Fri, Sep 6, 2013 at 10:48 PM, Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Friday, September 6, 2013, pg noob wrote:
>
>>
>> Hi all,
>>
>> I'm curious about some of the query estimates that I'm seeing with
>> queries that use DISTINCT.
>> I am using postgres 8.4.13
>>
>> I did a couple of quick tests, and found that PostgreSQL seems to do some
>> expensive work to
>> return DISTINCT rows. This is contrary to what I was expecting because I
>> expected that
>> if it knows that it is building a distinct result set that it would
>> maintain
>> some kind of ordering as it built the results to be able to quickly throw
>> away duplicates without any real overhead to the query.
>>
>
> There is no cost-free way of doing that. You either have to sort, or
> hash, or walk in index order, and none of those things are free.
>
> ...
>
>
>> But it gets a bit stranger than that.
>> According to the docs,
>> “DISTINCT ON column will eliminate all duplicates in the specified
>> column; this is equivalent to using GROUP BY column.”
>>
>
> DISTINCT ON is not the same thing as DISTINCT, which is not the same as
> count(distinct ...)
>
>
>
>
>>
>> Note that it says that using distinct is EQUIVALENT to using GROUP BY.
>> And yet, look at the execution time difference of DISTINCT vs. GROUP BY:
>>
>> db=# explain analyze select count(distinct(column1)) from table1;
>> QUERY
>> PLAN
>>
>> ------------------------------------------------------------------------------------------------------------------------------
>> Aggregate (cost=631397.32..631397.33 rows=1 width=8) (actual
>> time=18596.606..18596.607 rows=1 loops=1)
>> -> Seq Scan on table1 (cost=0.00..591411.65 rows=15994265 width=8)
>> (actual time=0.003..2391.644 rows=16151368 loops=1)
>> Total runtime: 18596.631 ms
>> (3 rows)
>>
>> db=# explain analyze select count(foo.column1) from (select column1 from
>> table1 group by column1) as foo;
>>
>
> Compare that to:
>
> explain analyze select count(*) from (select DISTINCT column1 from table1)
> as foo;
>
>
>>
>> Any ideas why this is? Or what I am doing wrong?
>>
>
> The code that executes "count(distinct ....)" has never been taught how to
> use hash aggregation, whereas DISTINCT and GROUP BY have been. It always
> sorts, which is often much slower than a hash, and also often much less
> memory efficient. I speculate that this is because the implementation of
> count(distinct ...) is really ancient code and never been brought up to
> date, either about hashing or about more thorough EXPLAIN estimates--I have
> been meaning to dig into it but haven't gotten around to it yet.
>
> You can get the increased performance by just spelling it in the more
> verbose way, but be aware that count (distinct ...) deals with NULL
> differently than select count(*) from (select DISTINCT...) does, so they
> are not exactly identical.
>
> Cheers,
>
> Jeff
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Karsten Hilbert 2013-09-09 15:22:13 Re: Call for design: PostgreSQL mugs
Previous Message Chris Curvey 2013-09-09 14:00:36 help getting a backtrace from 9.2 on Ubuntu 13.04?