Quick Links

Re: Need magic for identifieing double adresses

From:	Sam Mason <sam(at)samason(dot)me(dot)uk>
To:	pgsql-general(at)postgresql(dot)org
Subject:	Re: Need magic for identifieing double adresses
Date:	2010-09-16 13:39:11
Message-ID:	20100916133911.GC7862@samason.me.uk
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

On Thu, Sep 16, 2010 at 03:22:15PM +0200, Andreas wrote:
> We are talking about nearly 500.000 records with considerable overlapping.

Other things to consider is whether each one contains unique entries and
hence can you do a "best match" between datasets--FULL OUTER JOIN is
your friend here, but duplicates become a problem.

> It's not only typos to catch. There is variation in the way to write
> things that not necessarily are wrong.
> e.g.
> Miller's Bakery
> Bakery Miller
> Bakery Miller, Ltd.
> Bakery Miller and sons
> Bakery Smith (formerly Miller)

Soundex is tolerant to quite a lot of this, but word order is important.
When I've had to do this before ~360k merging with ~80k addresses I've
gone with normalised postcodes (in the UK postcodes contain a nice mix
of letters and numbers meaning that I can be reasonable sure about
typos) and then gone through a reasonable chunk by hand to make sure
things are working "correctly".

Just thought; depending on your spacial sparsity, you may be able
to get away with trusting the zip code and checking when the soundex of
the name is different.

> and the usual
> Strawberry Street
> Strawberrystreet
> Strawberry Str.42
> Strawberry Str. 42
> Strawberry Str. 42-45

Soundex gets those all the same (and even '42-45 Strawberry Str'), so
that's easy. In fact it completely ignores the numbers so you'll have
to do something specific about them.

--
Sam http://samason.me.uk/

In response to

Re: Need magic for identifieing double adresses at 2010-09-16 13:22:15 from Andreas

Browse pgsql-general by date

	From	Date	Subject
Next Message	Tom Lane	2010-09-16 13:49:20	Re: libssl issue ?
Previous Message	Merlin Moncure	2010-09-16 13:35:44	Re: Using libpq, floats and binary data