From: | Eric Walstad <eric(at)ericwalstad(dot)com> |
---|---|
To: | sfpug(at)postgresql(dot)org |
Subject: | SQL assistance, please... |
Date: | 2006-02-15 03:59:43 |
Message-ID: | 200602141959.43785.eric@ericwalstad.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | sfpug |
Hello everyone,
My addresses table includes a hash of the address, which I use to speed
searches. The searches are working well, but occasionally I find that there
are hash collisions like the following:
select distinct addr1, address_hash from addresses where address_hash =
'-1394669317' order by addr1;
addr1 | address_hash
-----------------------+--------------
123 SOME STREET | -1394669317
543 ANOTHER STREET | -1394669317
That is, totally different addresses happen to result in the same hash value.
I'd like to find out how often this is happening so I can see if I need to
change hash algorithms.
My question is: how can I find all such instances like this in my table?
I'd like something like:
where a.addr1 != b.addr1 and a.address_hash = b.address_hash
but I'm not sure how to join in the table twice in such a way that I don't get
a ton of duplicate records. Also, I'm not sure what this pattern is called,
so I don't know what to google for.
Any pointers are greatly apprceiated. Many thanks in advance,
Eric.
additional info...
The table looks like this:
uv2=# \d addresses
Column | Type | Modifiers
-----------------+--------------------------+----------------------------------------------------------------------
id | integer | not null default
nextval('public.addresses_id_seq'::text)
addr1 | character varying(255) | not null
addr2 | character varying(255) |
city | character varying(255) |
state | character varying(2) | not null
zipcode | character varying(5) |
zip4 | character varying(4) |
address_hash | character varying(128) |
The address hash is generated from a concatenation of addr1 - zip4 fields
(normalized: case insensitive, no punctuation, common replacements (street ->
st), etc.). Ideally, I'd like to find cases where any of the address fields
differ but the address_hash is the same.
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Frost | 2006-02-15 04:45:34 | Re: SQL assistance, please... |
Previous Message | Eric Walstad | 2006-02-14 21:31:42 | Re: invalid byte sequence for encoding "UNICODE": 0xd9 |