From: | John DeSoi <desoi(at)pgedit(dot)com> |
---|---|
To: | Andreas <maps(dot)on(at)gmx(dot)net> |
Cc: | pgsql-general(at)postgresql(dot)org |
Subject: | Re: Need magic for identifieing double adresses |
Date: | 2010-09-17 20:29:04 |
Message-ID: | CFB262B9-6831-49EA-938C-CBB1B3B36A8D@pgedit.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Sep 15, 2010, at 10:40 PM, Andreas wrote:
> I need to clean up a lot of contact data because of a merge of customer lists that used to be kept separate.
> I allready know that there are double entries within the lists and they do overlap, too.
>
> Relevant fields could be name, street, zip, city, phone
>
> Is there a way to do something like this with postgresql ?
>
> I fear this will need still a lot of manual sorting and searching even when potential peers get automatically identified.
I recently started working with the pg_trgm contrib module for matching songs based on titles and writers. This is especially difficult because the writer credits end up in one big field with every possible variation on order and naming conventions. So far I have been pleased with the results. For example, the algorithm correctly matched these two song titles:
FONTAINE DI ROMA AKA FOUNTAINS OF ROME
FOUNTAINS OF ROME A/K/A FONTANE DI ROMA
Trigrams can be indexed, so it is relatively fast to find an initial set of candidates.
There is a nice introductory article here:
http://www.postgresonline.com/journal/categories/59-pgtrgm
John DeSoi, Ph.D.
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Glaesemann | 2010-09-17 20:32:52 | Re: Alter Table Command Rearranges Rows |
Previous Message | Tom Lane | 2010-09-17 20:28:53 | Re: missing chunk number 497 for toast value 504723663 |