Re: De-duplicating rows

From: Frank Bax <fbax(at)sympatico(dot)ca>
To: pgsql-sql(at)postgresql(dot)org
Subject: Re: De-duplicating rows
Date: 2009-07-17 15:04:48
Message-ID: 4A609310.4050003@sympatico.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

Richard Huxton wrote:
> Christophe wrote:
>> Now, since this database has been production since 7.2 days, cruft has
>> crept in: in particular, there are duplicate email addresses, some
>> with mismatched attributes. The policy decision by the client is that
>> the correct row is the one with the earliest timestamp.
>
> Something like (untested):
>
> CREATE TEMPORARY TABLE earliest_duplicates AS
> SELECT
> email AS tgt_email,
> min(create_date) AS tgt_date
> FROM mytable
> GROUP BY email
> HAVING count(*) > 1;
>
> DELETE FROM mytable USING earliest duplicates
> WHERE email=tgt_email AND create_date > tgt_date;

If it is possible that two rows exist for the same email/date; then you
will likely need to deal with these manually.

If you rerun the above SELECT after running the delete you should
identify these rows.

In response to

Browse pgsql-sql by date

  From Date Subject
Next Message Scott Marlowe 2009-07-17 15:32:55 Re: De-duplicating rows
Previous Message Richard Huxton 2009-07-17 07:21:01 Re: De-duplicating rows