From: | rob stone <floriparob(at)gmail(dot)com> |
---|---|
To: | Michael Sheaver <msheaver(at)me(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: "Fuzzy" Matches on Nicknames |
Date: | 2016-11-30 00:56:34 |
Message-ID: | 1480467394.4654.1.camel@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hello Michael,
On Tue, 2016-11-29 at 19:10 -0500, Michael Sheaver wrote:
> Greetings,
>
> I have two tables that are populated using large datasets from
> disparate external systems, and I am trying to match records by
> customer name between these two tables. I do not have any
> authoritative key, such as customerID or nationalID, by which I can
> match them up, and I have found many cases where the same customer
> has different first names in the two datasets. A sampling of the
> differences is as follows:
>
> Michael <=> Mike
> Tom <=> Thomas
> Liz <=> Elizabeth
> Margaret <=> Maggie
>
> How can I build a query in PostgreSQL (v. 9.6) that will find
> possible matches like these on nicknames? My initial guess is that I
> would have to either find or build some sort of intermediary table
> that contains associated names like those above. Sometimes though,
> there will be more than matching pairs, like:
>
> Jim <=> James <=> Jimmy <=> Jimmie
> Bill <=> Will <=> Willie <=> William
>
> and so forth.
>
> Has anyone used or developed PostgreSQL queries that will find
> matches like these? I am running all my database queries. on my local
> laptops (Win7 and macOS), so performance or uptime is no issue here.
> I am curious to see how others in this community have creatively
> solved this common problem.
>
> One of the PostgreSQL dictionaries (synonym, thesaurus etc.) might
> work here, but honestly I am clueless as to how to set this up or use
> it in queries successfully.
>
> Thanks,
> Michael (aka Mike, aka Mikey)
>
Check out chapter F15 in the doco.
Try the double metaphone.
I worked on something similar many years ago cleaning up input created
by data entry clerks from hand written speeding tickets, so as to match
with "trusted" data held in a database.
As the volume of input was small in comparison with the number of
licensed drivers, we could iterate over and over again trying to match
it up.
HTH.
Rob
From | Date | Subject | |
---|---|---|---|
Next Message | John R Pierce | 2016-11-30 01:02:28 | Re: Monitoring Replication - Postgres 9.2 |
Previous Message | Michael Sheaver | 2016-11-30 00:10:59 | "Fuzzy" Matches on Nicknames |