De-duplicating rows

From: Christophe <xof(at)thebuild(dot)com>
To: pgsql-sql(at)postgresql(dot)org
Subject: De-duplicating rows
Date: 2009-07-17 03:07:09
Message-ID: 87E8E647-4CAD-4491-B687-FA763E0EBE7E@thebuild.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-sql

The Subject: is somewhat imprecise, but here's what I'm trying to do.
For some reason, my brain is locking up over it.

I'm moving a 7.2 (yes) database to 8.4. In the table in question, the
structure is along the lines of:

serial_number SERIAL, PRIMARY KEY
email TEXT
create_date TIMESTAMP
attr1 type
attr2 type
attr3 type
...

(The point of the "attr" fields is that there are many more columns
for each row.)

The new structure removes the "serial_number" field, and uses "email"
as the primary key, but is otherwise unchanged:

email TEXT, PRIMARY KEY
create_date TIMESTAMP
attr1 type
attr2 type
attr3 type
...

Now, since this database has been production since 7.2 days, cruft has
crept in: in particular, there are duplicate email addresses, some
with mismatched attributes. The policy decision by the client is that
the correct row is the one with the earliest timestamp. (The
timestamps are widely distributed; it's not the case that there is a
single timestamp above which all the duplicates live.) Thus, ideally,
I want to select exactly one row per "email", picking the row with the
earliest timestamp in the case that there is more than one row with
that email.

Any suggestions on how to write such a SELECT? Of course, I could do
this with an application against the db, but a single SELECT would be
great if possible.

TIA!

Responses

Browse pgsql-sql by date

  From Date Subject
Next Message Richard Huxton 2009-07-17 07:21:01 Re: De-duplicating rows
Previous Message Tim Landscheidt 2009-07-16 21:19:56 Comparing a string against an XPath result set