Re: Removing duplicates

From: Giuseppe Broccolo <giuseppe(dot)broccolo(at)2ndquadrant(dot)it>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Removing duplicates
Date: 2013-07-10 15:50:07
Message-ID: 51DD82AF.4060006@2ndquadrant.it
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Dear Johann,

I tried (with PostgreSQL 9.2) to run the two DELETE statements you
describe in your mail (the first based on the "id" field, the second on
the ctid) and they work! I have to point out that if you use the DELETE
based on the "id" field YOU'LL DELETE ALL RECORDS having at least one
duplicate.
The "q1.id != q.id" doesn't work because query doesn't know yet its
result so every row with a duplicate will match without exception.
If you use the DELETE based on ctid order you delete only records with
the same "rart_id", keeping the record with the lowest "id".

Remember that if you run the first DELETE query you remove more data
than you expect.

Regards,

Giuseppe.

Il 10/07/2013 09:11, Johann Spies ha scritto:
> I have used this method successfully on another table but this one is not
> working.
>
> I have a table of nearly 800 million records with some duplicates in.
>
> Here is an example:
>
> select rart_id, r9, ra, ry, rw, rv, rp, rs, ri
> from isi.rcited_ref
> where rart_id = 'A1986D733500019';
>
> renders a result of 72 records. When I do
>
> select count(id), rart_id, r9, ra, ry, rw, rv, rp, rs, ri
> from isi.rcited_ref
> where rart_id = 'A1986D733500019'
> group by rart_id, r9, ra, ry, rw, rv, rp, rs, ri;
>
> It shows that there are 36 duplicates with this rart_id.
>
> So as a test I did the following (the id-field is the primary key):
>
> DELETE FROM isi.rcited_ref q
> WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
> WHERE q1.id != q.id
> AND q.rart_id = q1.rart_id
> AND q.r9 = q1.r9
> AND q.ra = q1.ra
> AND q.ry = q1.ry
> AND q.rw = q1.rw
> AND q.rv = q1.rv
> AND q.rp = q1.rp
> AND q.rs = q1.rs
> AND q.rart_id = 'A1986D733500019'
> );
>
> But that deletes none. And I cannot see what went wrong.
>
> I have also tried the same query with ctid without success:
>
> DELETE FROM isi.rcited_ref q
> WHERE EXISTS(SELECT 1 FROM isi.rcited_ref q1
> WHERE q1.ctid < q.ctid
> AND q.rart_id = q1.rart_id
> AND q.r9 = q1.r9
> AND q.ra = q1.ra
> AND q.ry = q1.ry
> AND q.rw = q1.rw
> AND q.rv = q1.rv
> AND q.rp = q1.rp
> AND q.rs = q1.rs
> AND q.rart_id = 'A1986D733500019'
> );
>
>
> The size of the table makes it difficult to use a 'group by' method to
> delete all duplcates.
>
> What am I doing wrong?
>
> Regards
> Johann
> --
> Because experiencing your loyal love is better than life itself,
> my lips will praise you. (Psalm 63:3)

Browse pgsql-general by date

  From Date Subject
Next Message Jeff Janes 2013-07-10 16:58:24 Re: Force ssl connection
Previous Message dafNi zaf 2013-07-10 14:28:41 Re: dynamic partitioning