Re: "Fuzzy" Matches on Nicknames

From: rob stone <floriparob(at)gmail(dot)com>
To: Michael Sheaver <msheaver(at)me(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: "Fuzzy" Matches on Nicknames
Date: 2016-11-30 00:56:34
Message-ID: 1480467394.4654.1.camel@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Michael,
On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote:
> Greetings,
>
> I have two tables that are populated using large datasets from
> disparate external systems, and I am trying to match records by
> customer name between these two tables. I do not have any
> authoritative key, such as customerID or nationalID, by which I can
> match them up, and I have found many cases where the same customer
> has different first names in the two datasets. A sampling of the
> differences is as follows:
>
> Michael <=> Mike
> Tom <=> Thomas
> Liz <=> Elizabeth
> Margaret <=> Maggie
>
> How can I build a query in PostgreSQL (v. 9.6) that will find
> possible matches like these on nicknames? My initial guess is that I
> would have to either find or build some sort of intermediary table
> that contains associated names like those above. Sometimes though,
> there will be more than matching pairs, like:
>
> Jim <=> James <=> Jimmy <=> Jimmie
> Bill <=> Will <=> Willie <=> William
>
> and so forth.
>
> Has anyone used or developed PostgreSQL queries that will find
> matches like these? I am running all my database queries. on my local
> laptops (Win7 and macOS), so performance or uptime is no issue here.
> I am curious to see how others in this community have creatively
> solved this common problem.
>
> One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might
> work here, but honestly I am clueless as to how to set this up or use
> it in queries successfully.
>
> Thanks,
> Michael (aka Mike, aka Mikey)
>

Check out chapter F15 in the doco.
Try the double metaphone.
I worked on something similar many years ago cleaning up input created
by data entry clerks from hand written speeding tickets, so as to match
with "trusted" data held in a database.
As the volume of input was small in comparison with the number of
licensed drivers, we could iterate over and over again trying to match
it up.

HTH.
Rob

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message John R Pierce 2016-11-30 01:02:28 Re: Monitoring Replication - Postgres 9.2
Previous Message Michael Sheaver 2016-11-30 00:10:59 "Fuzzy" Matches on Nicknames