Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

From: George Neuner <gneuner2(at)comcast(dot)net>
To: pgsql-general(at)postgresql(dot)org
Subject: Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?
Date: 2019-12-15 21:59:30
Message-ID: rv8dveln3t9l4st5su4q2p9b8fir2l88bt@4ax.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 10 Dec 2019 18:00:02 -0600, Ron <ronljohnsonjr(at)gmail(dot)com>
wrote:

>On 12/10/19 3:11 PM, Erwin Brandstetter wrote:
>> I am looking for stable hash functions producing 8-byte or 4-byte hashes
>> from long text values in Postgres 10 or later.
>>
>> There is md5(), the result of which can be cast to uuid. This reliably
>> produces practically unique, stable 16-byte values. I have usecases where
>> an 8-byte or even 4-byte hash would be good enough to make collisions
>> reasonably unlikely. (I can recheck on the full string) - and expression
>> indexes substantially smaller. I could truncate md5 and cast back and
>> forth, but that seems like a lot of wasted computation. Are there
>> suggestions for text hash functions that are
>> - fast
>> - keep collisions to a minimum
>> - stable across major Postgres versions (so expression indexes don't break)
>> - croptographic aspect is not needed (acceptable, but no benefit)
>
>What about a CRC32 function?  It's fast, and an SSE4 instruction has been in
>Intel CPUs for about 10 years.

On long text CRC will not be as discriminating as a real cryptohash,
but it may be considerably faster to compute depending on the length
of the text. Given that the OP can check the actual string in the
event of a collision, it may work well enough.

One way to make it more discriminating is to compute a pair of CRCs
using different polynomials. Unfortunately the SSE4.2 CRC32
instruction uses a fixed polynomial. You can compute it twice using
different initial values, but the result won't be as good as actually
using different polynomials.

I used to create 4-byte hashes by concatenating two 16-bit CRCs - one
using the X-modem polynomial and the other using the CCITT polynomial.
Since these gave very different results, it worked quite well for my
use case where all the strings were guaranteed < 16KB.

YMMV,
George

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Erik Aronesty 2019-12-15 22:17:15 Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?
Previous Message Adrian Klaver 2019-12-15 16:32:31 Re: Is there any tool that provides Physical Backup plus PITR for a single database ( Not the whole PG instance ) ?