Re: select DISTINCT

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: pg noob <pgnube(at)gmail(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: select DISTINCT
Date: 2013-09-07 02:48:58
Message-ID: CAMkU=1y58XfcYeds9RMAHad_E=Y=N9mpgyBLiqqrw+0-_MsbJQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Friday, September 6, 2013, pg noob wrote:

>
> Hi all,
>
> I'm curious about some of the query estimates that I'm seeing with queries
> that use DISTINCT.
> I am using postgres 8.4.13
>
> I did a couple of quick tests, and found that PostgreSQL seems to do some
> expensive work to
> return DISTINCT rows. This is contrary to what I was expecting because I
> expected that
> if it knows that it is building a distinct result set that it would
> maintain
> some kind of ordering as it built the results to be able to quickly throw
> away duplicates without any real overhead to the query.
>

There is no cost-free way of doing that. You either have to sort, or
hash, or walk in index order, and none of those things are free.

...

> But it gets a bit stranger than that.
> According to the docs,
> “DISTINCT ON column will eliminate all duplicates in the specified column;
> this is equivalent to using GROUP BY column.”
>

DISTINCT ON is not the same thing as DISTINCT, which is not the same as
count(distinct ...)

>
> Note that it says that using distinct is EQUIVALENT to using GROUP BY.
> And yet, look at the execution time difference of DISTINCT vs. GROUP BY:
>
> db=# explain analyze select count(distinct(column1)) from table1;
> QUERY
> PLAN
>
> ------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=631397.32..631397.33 rows=1 width=8) (actual
> time=18596.606..18596.607 rows=1 loops=1)
> -> Seq Scan on table1 (cost=0.00..591411.65 rows=15994265 width=8)
> (actual time=0.003..2391.644 rows=16151368 loops=1)
> Total runtime: 18596.631 ms
> (3 rows)
>
> db=# explain analyze select count(foo.column1) from (select column1 from
> table1 group by column1) as foo;
>

Compare that to:

explain analyze select count(*) from (select DISTINCT column1 from table1)
as foo;

>
> Any ideas why this is? Or what I am doing wrong?
>

The code that executes "count(distinct ....)" has never been taught how to
use hash aggregation, whereas DISTINCT and GROUP BY have been. It always
sorts, which is often much slower than a hash, and also often much less
memory efficient. I speculate that this is because the implementation of
count(distinct ...) is really ancient code and never been brought up to
date, either about hashing or about more thorough EXPLAIN estimates--I have
been meaning to dig into it but haven't gotten around to it yet.

You can get the increased performance by just spelling it in the more
verbose way, but be aware that count (distinct ...) deals with NULL
differently than select count(*) from (select DISTINCT...) does, so they
are not exactly identical.

Cheers,

Jeff

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Steve Atkins 2013-09-07 04:29:40 Constraint exclusion and overlapping range checks
Previous Message Kevin Grittner 2013-09-07 00:51:06 Re: select DISTINCT