Quick Links

Re: Query performance question on a large table

From:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To:	Sean Shanny <shannyconsulting(at)earthlink(dot)net>
Cc:	"Pgsql-General(at)Postgresql(dot)Org (E-mail)" <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Query performance question on a large table
Date:	2004-01-07 07:31:22
Message-ID:	16879.1073460682@sss.pgh.pa.us
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Sean Shanny <shannyconsulting(at)earthlink(dot)net> writes:
> Sort of piggybacking on this thread but why the suggestion to drop the
> use of DISTINCT in 7.4?

Because the 7.4 planner can decide for itself whether DISTINCT'ifying
the sub-select output is the best way to proceed or not. There is more
than one good way to do an "IN sub-SELECT" operation, and the 7.4
planner knows several. (Pre-7.4 planners didn't know any :-( ... but
I digress.) When you write "foo IN (SELECT DISTINCT ...)", the DISTINCT
doesn't change the semantics at all, it just adds overhead.

In fact it's worse than that: if the planner decides that the best way
to proceed is to make the subselect output unique, it will throw another
layer of sort/unique processing on top of what you did. So writing
DISTINCT is actually a pessimization in 7.4.

> Example:

> -> HashAggregate (cost=1020025.13..1020025.13 rows=51 width=4)
> (actual time=954049.317..954450.065 rows=82208 loops=1)
> -> Subquery Scan "IN_subquery" (cost=983429.20..1020025.00
> rows=51 width=4) (actual time=856641.244..953639.116 rows=82208 loops=1)
> -> Unique (cost=983429.20..1020024.49 rows=51 width=4)
> (actual time=856641.230..952939.539 rows=82208 loops=1)
> -> Sort (cost=983429.20..1001726.84 rows=7319058
> width=4) (actual time=856641.215..906429.835 rows=11067735 loops=1)
> Sort Key: useragent_key
> -> Index Scan using

The sort/unique steps are coming from the DISTINCT. The HashAggregate
step is the planner making sure the output rows are distinct :-(

I just a couple days ago added some logic to CVS tip to notice that the
sub-select has a DISTINCT clause, and not add unnecessary unique-ifying
processing on top of it. So in 7.5, writing a DISTINCT clause will
amount to forcing a particular query plan, which might or might not be
the best thing but hopefully won't be too terrible. But in 7.4 it has
nothing to recommend it ...

regards, tom lane

In response to

Re: Query performance question on a large table at 2004-01-07 07:10:42 from Sean Shanny

Responses

Re: Query performance question on a large table at 2004-01-07 17:43:07 from Bruno Wolff III

Browse pgsql-general by date

	From	Date	Subject
Next Message	Chris Travers	2004-01-07 07:32:22	Re: Paypal WAS: PostgreSQL speakers needed for OSCON
Previous Message	Keith C. Perry	2004-01-07 07:14:51	Re: Paypal WAS: PostgreSQL speakers needed for OSCON