select count(distinct ...) is slower than select distinct in about 5x

From: jacket41142 <jacket41142(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: select count(distinct ...) is slower than select distinct in about 5x
Date: 2013-12-06 02:10:26
Message-ID: CAONnt+6Xz5YPaCxFp-js7BkAbjxWvRHQ9XdYto5baumOvQhDUw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hi

I'm just trying about PostgreSQL, I create a database "test" with a table
"t1":

test=> \d t1
Table "public.t1"
Column | Type | Modifiers
---------+---------+-----------------------------------------------------
col_id | integer | not null default nextval('t1_col_id_seq'::regclass)
col_int | integer |
Indexes:
"t1_pkey" PRIMARY KEY, btree (col_id)
"t1_col_int_idx" btree (col_int)

There are 10001000 rows in that table, and basically col_int of each row
are filled with random data within range [0,1024].

One strange thing I found is that:

test=> select distinct col_int from t1;
Time: 1258.627 ms
test=> select distinct col_int from t1;
Time: 1264.667 ms
test=> select distinct col_int from t1;
Time: 1261.805 ms

If I use "group by":

test=> select distinct col_int from t1 group by col_int;
Time: 1180.617 ms
test=> select distinct col_int from t1 group by col_int;
Time: 1179.849 ms
test=> select distinct col_int from t1 group by col_int;
Time: 1177.936 ms

So the performance difference is not very large.
But when I do that:

test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)

Time: 7367.476 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)

Time: 6946.233 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)

Time: 7386.969 ms
test=> select count(distinct col_int) from t1;
count
-------
1025
(1 row)

The speed is straightly worse! But it's doesn't make sense. Since we can
just make a temporary table or subquery and count for all rows. The
performance should be almost the same.
So do you have any idea about this? Or maybe PostgreSQL's query planner can
be improved for this kinds of query?

By the way, if I use a subquery to replace the above query, here's what I
got:

test=> select count(*) from (select distinct col_int from t1) as tmp;
count
-------
1025
(1 row)

Time: 1267.468 ms
test=> select count(*) from (select distinct col_int from t1) as tmp;
count
-------
1025
(1 row)

Time: 1257.327 ms
test=> select count(*) from (select distinct col_int from t1) as tmp;
count
-------
1025
(1 row)

Time: 1258.189 ms

OK, this workaround works. But I just think if postgres can improve with
this kind of query, it will be better. Also I'm not sure about other
similar scenarios that will cause similar problem.

The following is the output of "explain analyze ...":
test=> explain analyze select distinct col_int from t1;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=169268.05..169278.30 rows=1025 width=4) (actual
time=39034.653..39037.239 rows=1025 loops=1)
-> Seq Scan on t1 (cost=0.00..144265.04 rows=10001204 width=4) (actual
time=0.041..19619.931 rows=10001000 loops=1)
Total runtime: 39039.136 ms
(3 rows)

Time: 39103.622 ms
test=> explain analyze select distinct col_int from t1 group by col_int;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=169280.86..169291.11 rows=1025 width=4) (actual
time=39062.417..39064.882 rows=1025 loops=1)
-> HashAggregate (cost=169268.05..169278.30 rows=1025 width=4) (actual
time=39058.136..39060.303 rows=1025 loops=1)
-> Seq Scan on t1 (cost=0.00..144265.04 rows=10001204 width=4)
(actual time=0.024..19439.482 rows=10001000 loops=1)
Total runtime: 39066.896 ms
(4 rows)

Time: 39067.198 ms
test=> explain analyze select count(distinct col_int) from t1;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169268.05..169268.06 rows=1 width=4) (actual
time=45994.120..45994.123 rows=1 loops=1)
-> Seq Scan on t1 (cost=0.00..144265.04 rows=10001204 width=4) (actual
time=0.025..19599.950 rows=10001000 loops=1)
Total runtime: 45994.154 ms
(3 rows)

Time: 45994.419 ms

test=> explain analyze select count(*) from (select distinct col_int from
t1) as tmp;
QUERY
PLAN
-------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=169291.11..169291.12 rows=1 width=0) (actual
time=39050.598..39050.600 rows=1 loops=1)
-> HashAggregate (cost=169268.05..169278.30 rows=1025 width=4) (actual
time=39046.814..39048.742 rows=1025 loops=1)
-> Seq Scan on t1 (cost=0.00..144265.04 rows=10001204 width=4)
(actual time=0.035..19616.631 rows=10001000 loops=1)
Total runtime: 39050.634 ms
(4 rows)

Time: 39050.896 ms

P.S. I have already use "analyze verbose t1;" several times so the database
should already be optimized. But I'm just new to PostgreSQL.

The environment I use is:
PostgreSQL 9.3.1 (postgresql-9.3 9.3.1-1.pgdg12.4+1) on local machine
(actually a vbox VM, but when I try the test, I didn't run something very
different or some heavy program for all queries. And the result seems
consistent.)
Ubuntu 12.04 LTS
I followed the installation step in Quickstart section of
http://wiki.postgresql.org/wiki/Apt

best regards,
jacket41142

Browse pgsql-performance by date

  From Date Subject
Next Message Skarsol 2013-12-06 05:55:27 Re: WAL + SSD = slow inserts?
Previous Message Josh Berkus 2013-12-06 00:01:33 Re: One huge db vs many small dbs