Fwd: Slow Count-Distinct Query

From: Varadharajan Mukundan <srinathsmn(at)gmail(dot)com>
To: pgsql-performance(at)postgresql(dot)org
Subject: Fwd: Slow Count-Distinct Query
Date: 2014-04-04 09:31:58
Message-ID: CACKkDGE-6LJF-vJdmaPiGghLjwcQXDF1cAsnZtQhF+wA9KVtFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Sorry that i just joined the list and have to break the thread to reply to
Tom Lane's response on this @
http://www.postgresql.org/message-id/13741.1396275339@sss.pgh.pa.us

Note that the indexscan is actually *slower* than the seqscan so far as
> the table access is concerned; if the table were big enough to not fit
> in RAM, this would get very much worse. So I'm not impressed with trying
> to force the optimizer's hand as you've done here --- it might be a nice
> plan now, but it's brittle. See if a bigger work_mem improves matters
> enough with the regular plan.

I agree to the point that hand tuning optimiser is brittle and something
that should not be done. But the reason to that was to force the index-only
scan (Not the index scan). I feel Index-only scan would speed up given
postgres is row oriented and we are running count-distinct on a column in a
table with lot of columns (Say 6-7 in number). I think that is what have
contributed to the gain in performance.

I did a similar test with around 2 million tuples with work_mem = 250 MB
and got the query to respond with 2x speed up. But the speed-up got with
index-only scan was huge and response was in sub-seconds whereas with
work_mem the response was couple of seconds.

> I doubt you need the "where email=email" hack, in any case. That isn't
> forcing the optimizer's decision in any meaningful fashion.

This is the where clause which forced the index-only scan on partial index.
AFAIK, whenever the the tuples estimated to be processed by query is going
to be high, Postgres does a seq scan. This was happening even in the
following query even though an index was present in the email column and
index-only could have had much faster processing. I think the planner is
taking a hit on estimating the cost of the query with index-only scan. So
the little trick we had put in was to create a partial index on email field
of the participants table and use the same condition of the partial index
in the query to trick the optimiser to use index-only scans.

Query : select count(distinct email) from participants;

Please revert me back if i'm missing something.

--
Thanks,
M. Varadharajan

------------------------------------------------

"Experience is what you get when you didn't get what you wanted"
-By Prof. Randy Pausch in "The Last Lecture"

My Journal :- www.thinkasgeek.wordpress.com

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Hans Drexler 2014-04-04 12:00:31 Batch update query performance
Previous Message Ilya Kosmodemiansky 2014-04-04 06:11:49 Re: PGSQL, checkpoints, and file system syncs