Quick Links

Re: Simple machine-killing query!

From:	Stephan Szabo <sszabo(at)megazone(dot)bigpanda(dot)com>
To:	Victor Ciurus <vikcious(at)gmail(dot)com>
Cc:	pgsql-performance(at)postgresql(dot)org
Subject:	Re: Simple machine-killing query!
Date:	2004-10-21 15:05:59
Message-ID:	20041021075757.A54616@megazone.bigpanda.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

On Thu, 21 Oct 2004, Victor Ciurus wrote:

> Hi all,
>
> I'm writing this because I've reached the limit of my imagination and
> patience! So here is it...
>
> 2 tables:
> 1 containing 27 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field. (10 - 145 char lenght per record)
> 1 containing 2.5 million variable lenght, alpha-numeric records
> (strings) in 1 (one) field.
>
> table wehere created using:
> CREATE TABLE "public"."BIGMA" ("string" VARCHAR(255) NOT NULL) WITH OIDS; +
> CREATE INDEX "BIGMA_INDEX" ON "public"."BIGMA" USING btree ("string");
> and
> CREATE TABLE "public"."DIRTY" ("string" VARCHAR(128) NOT NULL) WITH OIDS; +
> CREATE INDEX "DIRTY_INDEX" ON "public"."DIRTY" USING btree ("string");
>
> What I am requested to do is to keep all records from 'BIGMA' that do
> not apear in 'DIRTY'
> So far I have tried solving this by going for:
>
> [explain] select * from BIGMA where string not in (select * from DIRTY);
> QUERY PLAN
> ------------------------------------------------------------------------
> Seq Scan on bigma (cost=0.00..24582291.25 rows=500 width=145)
> Filter: (NOT (subplan))
> SubPlan
> -> Seq Scan on dirty (cost=0.00..42904.63 rows=2503963 width=82)
> (4 rows)

Have you analyzed bigma? The number of rows from the two explains for that
table look suspiciously like default values.

Also, what version are you using, because there are some differences from
7.3 to 7.4 that change possible suggestions.

The first is that on 7.4, you may be able to do better with a higher
sort_mem which could possible switch over to the hashed implementation,
although I think it's probably going to take a pretty high value given the
size.

The second is that you might get better results (even on older versions)
from an exists or left join solution, something like (assuming no nulls in
bigma.email):

select * from bigma where not exists(select 1 from dirty where dirty.email
!= bigma.email);

select bigma.* from bigma left outer join dirty on (dirty.email =
bigma.email) where dirty.email is null;

If you've got nulls in bigma.email you have to be a little more careful.

> [explain] select * from bigma,dirty where bigma.email!=dirty.email;

This *almost* certainly does not do what you want. For most data sets
this is going to give you a number of rows very close to # of rows in
dirty * # of rows in bigma. Needless to say, this is going to take a long
time.

In response to

Simple machine-killing query! at 2004-10-21 14:34:17 from Victor Ciurus

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Aaron Werman	2004-10-21 15:14:14	Re: Simple machine-killing query!
Previous Message	Jan Dittmer	2004-10-21 15:02:00	Re: Anything to be gained from a 'Postgres Filesystem'?