Detecting duplicates in messy data

From: Tim Uckun <timuckun(at)gmail(dot)com>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Detecting duplicates in messy data
Date: 2011-06-06 10:48:30
Message-ID: BANLkTi=1RaWKK1bVThjXyeFWcx8+dOR-5A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I have a couple of tables (people and addresses) which are using
serials as primary keys and contain many potentially duplicate data in
them. The problem is that the data has not been input in a careful way
so for example you have a first_name, middle_name and last_name fields
but you could have Samuel L Jackson, Samuel Jackson, Sam Jackson and
even Jackson L Samuel (data in the wrong fields) in the database
representing the same person.

I have been thinking of some algorithms that might work to identify
the duplicate records but I am no mathematician so I thought I would
ask here before I wasted a lot of time trying to solve a problem that
has already been solved. Postgres has lots of great functionality in
the fuzzystringmatch so I am sure it can excel at this kind of thing.

Any ideas or links to documents would be much appreciated.

Cheers.

Browse pgsql-general by date

  From Date Subject
Next Message Craig Ringer 2011-06-06 10:52:37 Re: SQLite-PostgreSQL comparison
Previous Message Radosław Smogura 2011-06-06 10:10:36 Re: using jboss with ident auth