From: | "Mason Hale" <masonhale(at)gmail(dot)com> |
---|---|
To: | pgsql-general <pgsql-general(at)postgresql(dot)org> |
Subject: | Re: Using hashtext and unique constraints together |
Date: | 2007-12-12 00:00:16 |
Message-ID: | 8bca3aa10712111600w70fe53bfg202ae2d55e7818cd@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I recently discovered the hashtext() function, and I'm interested in using
> it to reduce the size of one of the biggest indexes in my database.
>
...
The problem with either MD5 or hashtext() is that neither can guarantee
> unique output even if the inputs are all unique.
>
...
>
> The problem I need help with is guaranteeing uniqueness of the URL on
> insert, with a non-unique index on hashtext(url) and *without* creating a
> unique index on page(url).
>
> I'm thinking that an insert trigger that ensures (SELECT count(*) FROM
> page WHERE hashtext(url) = hashtext('http://www.postgresql.org'<http://www.postgresql.org%27>)
> AND url = ' http://www.postgresql.org' ) = 0 won't work given MVCC, as two
> transactions could simultaneously insert the same url at the same time.
>
> Can anyone think of a way to guarantee uniqueness of the URL without
> adding a unique constraint on the page(url) column?
>
I gather from the lack of response the answer is "no" ?
Sorry to respond to my own post... but I had an idea -- what if hash the
same URL two different ways, and create a unique compound index on both
hashes?
For example
CREATE unique index on page( hashtext(url), hashtext(md5(url)) )
The idea here is that the odds for two different string to hash to the same
value with hashtext *and* md5 is remote enough to rely on for uniqueness.
I'm converting the md5 to hashtext to save space.
or even
CREATE unique index on page( hashtext(url), hashtext(url || 'SALT') )
An alternative approach, assumes that two different strings that hash to the
same value, will not also hash to the same value if the concatenated with
some additional (constant) data. Again, doesn't absolutely guarantee
uniqueness, but seems like it should be safe enough. I guess it depends on
how truly random the hashtext output is.
Both of the above would require 8 bytes for each index entry (plus
overhead), which is still a big savings over our current implementation.
Any thoughts on these ideas?
thanks in advance,
Mason
From | Date | Subject | |
---|---|---|---|
Next Message | Gregory Williamson | 2007-12-12 00:22:36 | Re: Hijack! |
Previous Message | Gregory Stark | 2007-12-11 23:55:40 | Re: Hijack! |