From: | Andrew Perrin <andrew_perrin(at)unc(dot)edu> |
---|---|
To: | Matthew Hagerty <matthew(at)brwholesale(dot)com> |
Cc: | pgsql-sql(at)postgresql(dot)org |
Subject: | Re: Removing duplicates |
Date: | 2002-02-26 15:44:53 |
Message-ID: | Pine.LNX.4.21L1.0202261041350.7939-100000@nujoma.perrins |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-sql |
Zip code and then address seems reasonable to me. If you want to go
all-out, you could write a script that checks addresses against the USPS's
address system (www.usps.com/ncsc/lookups/lookup_zip+4.html) which will
return a standardized address. If you do that for your whole table you'll
end up being able to compare these standardized addresses against one
another, thereby even matching PO Box 123 vs. P.O. Box 123.
ap
----------------------------------------------------------------------
Andrew J Perrin - andrew_perrin(at)unc(dot)edu - http://www.unc.edu/~aperrin
Assistant Professor of Sociology, U of North Carolina, Chapel Hill
269 Hamilton Hall, CB#3210, Chapel Hill, NC 27599-3210 USA
On Tue, 26 Feb 2002, Matthew Hagerty wrote:
> Greetings,
>
> I have a customer database (name, address1, address2, city, state, zip) and
> I need a query (or two) that will give me a mailing list with the least
> amount of duplicates possible. I know that precise matching is not
> possible, i.e. "P.O. Box 123" will never match "PO Box 123" without some
> data massaging, but if I can isolate even 50% of any duplicates, that would
> help greatly.
>
> Also, any suggestions on which parameters to check the duplicates for? My
> first thoughts were to make sure there were no two addresses the same in
> the same zip code. Any insight (or examples) would be greatly appreciated.
>
> Thank you,
> Matthew
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo(at)postgresql(dot)org
>
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2002-02-26 15:55:41 | Re: Timestamp output |
Previous Message | Matthew Hagerty | 2002-02-26 15:10:12 | Removing duplicates |