Re: Finding uniques across a big join

From: Scott Marlowe <smarlowe(at)g2switchworks(dot)com>
To: "John D(dot) Burger" <john(at)mitre(dot)org>
Cc: PostgreSQL general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Finding uniques across a big join
Date: 2005-11-30 22:42:26
Message-ID: 1133390546.16010.30.camel@state.g2switchworks.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, 2005-11-30 at 16:27, John D. Burger wrote:
> Scott Marlowe wrote:
>
> > select
> > v1.pkey1,
> > v1.field2,
> > v1.field3,
> > v1.field4,
> > v2.pkey1,
> > v2.field2,
> > v2.field3,
> > v2.field4,
> > from
> > view v1
> > join
> > view v2
> > on (
> > v1.field2=v2.field2 and
> > v1.field3=v2.field3 and
> > v1.field3=v2.field3 and
> > v1.pkey1<>v2.pkey
> > )
> >
> > How does that work?
>
> Won't this be a massive cross product of all pkey pairs that have the
> same field values?
>

Yes, assuming there are a lot of them. OTOH, if there are only a few
duplicates you're looking for...

How many are you expecting, percentage wise, to get back?

> Here's what I'm currently using, in terms of your very helpful view:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
> from view as v1
> join
> (select v2.field1, v2.field2, v2.field3
> from view as v2
> group by v2.field2, v2.field3, v2.field4
> having count(*) = 1)
> using (field2, field3, field4);
>
> This is the one that takes eight hours. :( Another way to express what
> I want is this:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
> from view as v1
> where not exists
> (select true from view as v2
> where v1.field2 = v2.field2
> and v1.field3 = v2.field3
> and v1.field4 = v2.field4
> and v1.pkey1 <> v2.pkey1);
>
> That looks like a horrible nested loop, but I suppose I should try it
> to make sure it is indeed slower then the previous query.

If you can allocated enough shared memory for the set to fit in memory,
you might be able to get a hash agg method, which is much faster than
most other methods for this kind of thing, since it requires no sort.

In a side point, I'm currently mushing 888,000,000 6 character codes up
against each other to check for duplicates. I have 6 machines doing
this, at 1 million codes compared to 1 million codes every 0.5 seconds
aggregate. That gets me down to about 1 week. So, 8 hours is seeming
quite fast. :)

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tim Bunce 2005-11-30 23:35:55 Re: undefined behaviour for sub-transactions?
Previous Message Tyler MacDonald 2005-11-30 22:34:11 Re: undefined behaviour for sub-transactions?