From: | Roger Pack <rogerdpack2(at)gmail(dot)com> |
---|---|
To: | pgsql-general(at)postgresql(dot)org |
Subject: | count distinct slow? |
Date: | 2014-11-17 23:18:24 |
Message-ID: | CAL1QdWcJstuMdJ-NZ1C0LiDuTK48K6JH3LukMSEvzC86gXXGGQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello.
As a note, I ran into the following today (doing a select distinct is fast,
doing a count distinct is significantly slower?)
assume a table "issue" with a COLUMN nodename character varying(64);, 7.5M
rows...
select distinct substring(nodename from 1 for 9) from issue;
-- 5.8s
select count(distinct substring(nodename from 1 for 9)) from issue;
-- 190s
SELECT COUNT(*) FROM (SELECT DISTINCT substring(nodename from 1 for 9) FROM
issue) as temp;
-- 5.5s
I have an index on nodename's substring 1 for 9.
It struck me as odd that a count distinct would be far slower than
selecting distinct rows themselves. Apparently there are other workarounds
people have come up with as well [1]. Just mentioning in case it's helpful.
Cheers!
-roger-
[1]
http://stackoverflow.com/questions/11250253/postgresql-countdistinct-very-slow/14732410#14732410
explains:
explain analyze select count(distinct substring(nodename from 1 for 9))
from issue;
Aggregate (cost=222791.77..222791.78 rows=1 width=16) (actual
time=190641.069..190641.071 rows=1 loops=1)
-> Seq Scan on issue (cost=0.00..185321.51 rows=7494051 width=16)
(actual time=0.016..3487.694 rows=7495551 loops=1)
Total runtime: 190641.182 ms
explain analyze select distinct substring(nodename from 1 for 9) from
issue;
HashAggregate (cost=222791.77..222846.45 rows=4375 width=16) (actual
time=6276.578..6278.004 rows=6192 loops=1)
-> Seq Scan on issue (cost=0.00..204056.64 rows=7494051 width=16)
(actual time=0.058..4293.976 rows=7495551 loops=1)
Total runtime: 6278.564 ms
explain analyze SELECT COUNT(*) FROM (SELECT DISTINCT substring(nodename
from 1 for 9) FROM issue) as temp;
Aggregate (cost=222901.14..222901.15 rows=1 width=0) (actual
time=5195.025..5195.025 rows=1 loops=1)
-> HashAggregate (cost=222791.77..222846.45 rows=4375 width=16) (actual
time=5193.121..5194.454 rows=6192 loops=1)
-> Seq Scan on issue (cost=0.00..204056.64 rows=7494051 width=16)
(actual time=0.035..3402.834 rows=7495551 loops=1)
Total runtime: 5195.160 ms
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2014-11-18 01:26:24 | Re: count distinct slow? |
Previous Message | Roger Pack | 2014-11-17 23:03:37 | Fwd: ability to return number of rows inserted into child partition tables request |