Slow Count-Distinct Query

From: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
To: Varadharajan Mukundan <srinathsmn(at)gmail(dot)com>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Slow Count-Distinct Query
Date: 2014-04-06 17:26:18
Message-ID: CAMkU=1zRyDEOoQPEiqj5mdr0wz-w-wyc_YpaxEndWa0BnOJxRA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Friday, April 4, 2014, Varadharajan Mukundan <srinathsmn(at)gmail(dot)com>
wrote:

> Hi Jeff,
>
> It looks like the original emailer wrote a query that the planner is not
>> smart enough to plan properly (A known limitation of that kind of query).
>> He then made a bunch of changes, none of which worked. He then re-wrote
>> the query into a form for which the planner does a better job on. What we
>> do not know is, what would happen if he undoes all of those other changes
>> and *just* uses the new form of the query?
>>
>
>
> I was also pairing up with Chris (original emailer) on this issue and in
> order to reproduce it, i've a created a two column table with following
> schema with 1.9 Million dummy rows:
>
> =================================
>
> a=# \d participants
> Table "public.participants"
> Column | Type | Modifiers
>
> --------+------------------------+-----------------------------------------------------------
> id | integer | not null default
> nextval('participants_id_seq'::regclass)
> email | character varying(255) |
>
> I've tried out various scenarios on this table and recorded it as a
> transcript below: (Please read it as we read a shell script from top to
> bottom continuously to get the whole idea):
>
> *; Create table and Insert 1.9 Million rows*
>
> a=# create table participants(id serial, email varchar(255));
> NOTICE: CREATE TABLE will create implicit sequence "participants_id_seq"
> for serial column "participants.id"
> CREATE TABLE
> a=# \d participants
> Table "public.participants"
> Column | Type | Modifiers
>
> --------+------------------------+-----------------------------------------------------------
> id | integer | not null default
> nextval('participants_id_seq'::regclass)
> email | character varying(255) |
>
> a=# copy participants from '/tmp/a.csv' with csv;
> COPY 1999935
>

Thanks for the detailed response. I don't have access to your /tmp/a.csv
of course, I so I just used this:

insert into participants (email) select md5(floor(random()*1000000)::text)
from generate_series(1,2000000);

This gives each email showing up about twice.

>
> a=# EXPLAIN (ANALYZE) select count(1) from (select email from participants
> where email=email group by email) x;
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=37874.94..37874.96 rows=1 width=0) (actual
> time=1549.258..1549.258 rows=1 loops=1)
> -> HashAggregate (cost=37763.19..37812.86 rows=4967 width=16) (actual
> time=1168.114..1461.672 rows=1000000 loops=1)
> -> Seq Scan on participants (cost=0.00..37738.19 rows=10000
> width=16) (actual time=0.045..411.267 rows=1999935 loops=1)
> Filter: ((email)::text = (email)::text)
> Total runtime: 1567.586 ms
> (5 rows)
>

What you have done here is trick the planner into thinking your query will
be 200 times smaller than it actually is, and thus the hash table will be
200 times smaller than it actually is and therefore will fit in allowed
memory. This is effective at getting the more efficient hash agg. But it
no more safe than just giving it explicit permission to use that much
memory for this query by increasing work_mem by 200 fold.

I am kind of surprised that the planner is so easily fooled by that.

>
>
> *; Creation of idx on email field*
>
> a=# create index email_idx on participants(email);
> CREATE INDEX
>
>

> a=# EXPLAIN (ANALYZE) select count(1) from (select email from
> participants group by email) x;
> QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=48622.59..48622.60 rows=1 width=0) (actual
> time=1242.718..1242.718 rows=1 loops=1)
> -> HashAggregate (cost=26273.09..36206.20 rows=993311 width=16)
> (actual time=855.215..1150.781 rows=1000000 loops=1)
> -> Seq Scan on participants (cost=0.00..21273.25 rows=1999935
> width=16) (actual time=0.058..217.105 rows=1999935 loops=1)
> Total runtime: 1264.234 ms
> (4 rows)
>

I can't reproduce this at all, except by increasing work_mem. The hash
table needed for this is no smaller than the hash table needed before the
index was built. Did you increase work_mem before the above plan?

Instead what I get is the index only scan (to provide order) feeding into a
Group.

>
> a=# drop index email_idx;
> DROP INDEX
>
> *; Creation of partial index on email *
>
> a=# create index email_idx on participants(email) where email=email;
> CREATE INDEX
>
> a=# EXPLAIN (ANALYZE) select count(distinct email) from participants
> where email=email;
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=12949.26..12949.27 rows=1 width=16) (actual
> time=3465.472..3465.473 rows=1 loops=1)
> -> Bitmap Heap Scan on participants (cost=249.14..12924.26 rows=10000
> width=16) (actual time=161.776..421.223 rows=1999935 loops=1)
> Recheck Cond: ((email)::text = (email)::text)
> -> Bitmap Index Scan on email_idx (cost=0.00..246.64 rows=10000
> width=0) (actual time=159.446..159.446 rows=1999935 loops=1)
> Total runtime: 3466.867 ms
>

I also cannot get this.

> (5 rows)
>
> a=# set enable_bitmapscan = false;
> a=# set seq_page_cost = 0.1;
> SET
> a=# set random_page_cost = 0.2;
> SET
>

These don't seem to accomplish anything for you. They switch the slow form
of the query between two plans with about the same speed.

>
> a=# explain analyze select count(distinct email) from participants where
> email=email;
> QUERY
> PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=1517.16..1517.17 rows=1 width=16) (actual
> time=3504.310..3504.310 rows=1 loops=1)
> -> Index Only Scan using email_idx on participants
> (cost=0.00..1492.16 rows=10000 width=16) (actual time=0.101..795.595 rows=1999935
> loops=1)
> Heap Fetches: 1999935
> Total runtime: 3504.358 ms
> (4 rows)
>
> a=# explain analyze select count(1) from (select email from participants
> where email=email group by email) x;
>
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=1579.25..1579.26 rows=1 width=0) (actual
> time=1193.912..1193.912 rows=1 loops=1)
> -> Group (cost=0.00..1517.16 rows=4967 width=16) (actual
> time=0.096..1101.828 rows=1000000 loops=1)
> -> Index Only Scan using email_idx on participants
> (cost=0.00..1492.16 rows=10000 width=16) (actual time=0.091..719.281
> rows=1999935 loops=1)
> Heap Fetches: 1999935
> Total runtime: 1193.963 ms
> (5 rows)
>
> *; Oh yes, cluster the rows by email*
>
> a=# create index email_idx_2 on participants(email)
> a-# ;
> CREATE INDEX
>
> a=# cluster participants using email_idx_2;
> CLUSTER
>
>
> a=# EXPLAIN (ANALYZE) select count(1) from (select email from participants
> where email=email group by email) x;
>
> QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------------------------------------------
> Aggregate (cost=3376.63..3376.64 rows=1 width=0) (actual
> time=942.118..942.119 rows=1 loops=1)
> -> Group (cost=0.00..3314.54 rows=4967 width=16) (actual
> time=0.080..851.315 rows=1000000 loops=1)
> -> Index Only Scan using email_idx on participants
> (cost=0.00..3289.54 rows=10000 width=16) (actual time=0.076..472.101
> rows=1999935 loops=1)
> Heap Fetches: 1999935
> Total runtime: 942.159 ms
> (5 rows)
>

Once I cluster and vacuum and analyze the table, I get this plan without
having the partial index (just the regular index), without the "where
email=email" and without disabling the bitmap scan or changing the
page_cost parameters.

I usually get this plan without the cluster, to. I think it depends on the
luck of the sampling in the autoanalyze.

Cheers,

Jeff

>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Thom Brown 2014-04-06 19:07:41 Re: PGSQL 9.3 - Materialized View - multithreading
Previous Message Tom Lane 2014-04-06 14:55:37 Re: SSI slows down over time