Re: Need magic for identifieing double adresses

From: "Octavio Alvarez" <alvarezp(at)alvarezp(dot)ods(dot)org>
To: "Sam Mason" <sam(at)samason(dot)me(dot)uk>, Andreas <maps(dot)on(at)gmx(dot)net>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Need magic for identifieing double adresses
Date: 2010-09-23 15:38:38
Message-ID: op.vjhsd8qv4oyyg1@alvarezp-ws
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, 16 Sep 2010 06:22:15 -0700, Andreas <maps(dot)on(at)gmx(dot)net> wrote:

> It's not only typos to catch. There is variation in the way to write
> things that not necessarily are wrong.
> e.g.
> Miller's Bakery
> Bakery Miller
> Bakery Miller, Ltd.
> Bakery Miller and sons
> Bakery Smith (formerly Miller)
>
> and the usual
> Strawberry Street
> Strawberrystreet
> Strawberry Str.42
> Strawberry Str. 42
> Strawberry Str. 42-45

If this is a one-time procedure, I'd definitely go manually. The key is to
quickly bind records and find the "remaining" ones.

I'd create a lookup table and bind all similar values to a single value.

I would also take each word in the field, turn it to lower case, remove
punctuation signs and enter it in another table (original_word varchar,
normalized_word varchar). I would then search for the most popular
normalized_word, hoping that would throw me back keywords like
"strawberry" and "miller". I would then search for those to continue
creating the look up table.

You might want to write an interface to let you drag all the DISTINCT
keywords and drop them to the "single" value.

I have never seen it, though. :)

Good luck.

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Bruce Momjian 2010-09-23 16:23:34 Re: Unable to upgrade old cluster from 8.4 to 9.0
Previous Message Thom Brown 2010-09-23 15:31:56 Re: Unable to upgrade old cluster from 8.4 to 9.0