From: | Scott Marlowe <smarlowe(at)g2switchworks(dot)com> |
---|---|
To: | "John D(dot) Burger" <john(at)mitre(dot)org> |
Cc: | PostgreSQL general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Finding uniques across a big join |
Date: | 2005-11-30 22:42:26 |
Message-ID: | 1133390546.16010.30.camel@state.g2switchworks.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Wed, 2005-11-30 at 16:27, John D. Burger wrote:
> Scott Marlowe wrote:
>
> > select
> > v1.pkey1,
> > v1.field2,
> > v1.field3,
> > v1.field4,
> > v2.pkey1,
> > v2.field2,
> > v2.field3,
> > v2.field4,
> > from
> > view v1
> > join
> > view v2
> > on (
> > v1.field2=v2.field2 and
> > v1.field3=v2.field3 and
> > v1.field3=v2.field3 and
> > v1.pkey1<>v2.pkey
> > )
> >
> > How does that work?
>
> Won't this be a massive cross product of all pkey pairs that have the
> same field values?
>
Yes, assuming there are a lot of them. OTOH, if there are only a few
duplicates you're looking for...
How many are you expecting, percentage wise, to get back?
> Here's what I'm currently using, in terms of your very helpful view:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
> from view as v1
> join
> (select v2.field1, v2.field2, v2.field3
> from view as v2
> group by v2.field2, v2.field3, v2.field4
> having count(*) = 1)
> using (field2, field3, field4);
>
> This is the one that takes eight hours. :( Another way to express what
> I want is this:
>
> select v1.pkey1, v1.field2, v1.field3, v1.field4
> from view as v1
> where not exists
> (select true from view as v2
> where v1.field2 = v2.field2
> and v1.field3 = v2.field3
> and v1.field4 = v2.field4
> and v1.pkey1 <> v2.pkey1);
>
> That looks like a horrible nested loop, but I suppose I should try it
> to make sure it is indeed slower then the previous query.
If you can allocated enough shared memory for the set to fit in memory,
you might be able to get a hash agg method, which is much faster than
most other methods for this kind of thing, since it requires no sort.
In a side point, I'm currently mushing 888,000,000 6 character codes up
against each other to check for duplicates. I have 6 machines doing
this, at 1 million codes compared to 1 million codes every 0.5 seconds
aggregate. That gets me down to about 1 week. So, 8 hours is seeming
quite fast. :)
From | Date | Subject | |
---|---|---|---|
Next Message | Tim Bunce | 2005-11-30 23:35:55 | Re: undefined behaviour for sub-transactions? |
Previous Message | Tyler MacDonald | 2005-11-30 22:34:11 | Re: undefined behaviour for sub-transactions? |