Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?

From: Miles Elam <miles(dot)elam(at)productops(dot)com>
To: Erwin Brandstetter <brsaweda(at)gmail(dot)com>
Cc: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>
Subject: Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?
Date: 2019-12-10 22:34:12
Message-ID: CAALojA9PE7BO8E3CPQANAOGoT9nox1sGE=hB3qOfVYYppRTt2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

In terms of "wasted computation", MD5, SHA1, and the others always compute
the full length before they are passed to a UUID, int, or whatever. It's a
sunk cost. It's also a minor cost considering many hash algorithms are
performed in CPU hardware now. All that's left is the truncation and cast,
which you can't avoid easily.

Sure, you could reimplement Java's .hashCode() method by iterating through
the characters and processing the character codes:

s[0]*31^(n - 1) + s[1]*31^(n - 2) + ... + s[n - 1]

I don't see how that would beat the CPU-based hashes though unless you
wrote a C-based extension. Maybe it's better just to embrace the
user-defined function first and then decide if performance is insufficient
for your use cases.

CREATE EXTENSION IF NOT EXISTS pgcrypto;

CREATE OR REPLACE FUNCTION hash8 (p_data text, p_algo text = 'md5') RETURNS
int8 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 16),
'hex'))::bit(64)::int8

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;

CREATE OR REPLACE FUNCTION hash4 (p_data text, p_algo text = 'md5') RETURNS
int4 AS $$

SELECT ('x' || encode(substring(digest(p_data, p_algo) FROM 1 FOR 8),
'hex'))::bit(32)::int4

$$ LANGUAGE sql IMMUTABLE PARALLEL SAFE;

SELECT

hash4('something something something'),

hash4('something something something', 'sha1'),

hash8('something something something'),

hash8('something something something', 'sha1');

Cheers,

Miles

On Tue, Dec 10, 2019 at 1:12 PM Erwin Brandstetter <brsaweda(at)gmail(dot)com>
wrote:

> I am looking for stable hash functions producing 8-byte or 4-byte hashes
> from long text values in Postgres 10 or later.
>
> There is md5(), the result of which can be cast to uuid. This reliably
> produces practically unique, stable 16-byte values. I have usecases where
> an 8-byte or even 4-byte hash would be good enough to make collisions
> reasonably unlikely. (I can recheck on the full string) - and expression
> indexes substantially smaller. I could truncate md5 and cast back and
> forth, but that seems like a lot of wasted computation. Are there
> suggestions for text hash functions that are
> - fast
> - keep collisions to a minimum
> - stable across major Postgres versions (so expression indexes don't break)
> - croptographic aspect is not needed (acceptable, but no benefit)
>
> There is an old post from 2012 by Tom Lane suggesting that hashtext() and
> friends are not for users:
>
> https://www.postgresql.org/message-id/24463.1329854466%40sss.pgh.pa.us
>
> Postgres 11 added hashtextextended() and friends to generate bigint
> hashes. In a more recent post from 3 months ago, Tom suggests to use it in
> user-land - if portability is not needed:
>
> https://www.postgresql.org/message-id/9434.1568839177%40sss.pgh.pa.us
>
> Is pghashlib by Marko Kreen my best option?
>
> https://github.com/markokr/pghashlib
>
> Or the "version-independent hash functions for PostgreSQL" from Peter
> Eisentraut:
>
> https://github.com/petere/pgvihash
>
> Neither received updates for a couple of years. Unmaintained? Or obsolete?
> And neither is available on most hosted services like RDS or Heroku (which
> would be required in come cases).
>
> So what are my best options?
>
> Regards
> Erwin
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message raf 2019-12-10 23:10:36 Re: PostgreSQL vs PostgresQL
Previous Message Laurenz Albe 2019-12-10 22:13:23 Re: Fast, stable, portable hash function producing 4-byte or 8-byte values?