Re: Hash Functions

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Joe Conway <mail(at)joeconway(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, amul sul <sulamul(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Hash Functions
Date: 2017-08-03 21:09:41
Message-ID: CA+TgmoYRYrBk8+RU1sd=GGirvtSM1-Su1VuB4zsuJy0h-dDH9w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jun 1, 2017 at 2:25 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Just to clarify: I don't think it's a problem to do so for integers and
> most other simple scalar types. There's plenty hash algorithms that are
> endianess independent, and the rest is just a bit of care.

Do you have any feeling for which of those endianness-independent hash
functions might be a reasonable choice for us?

https://github.com/markokr/pghashlib implements various hash functions
for PostgreSQL, and claims that, of those implemented, crc32, Jenkins,
lookup3be and lookup3le, md5, and siphash24 are endian-independent.

An interesting point here is that Jeff Davis asserted in the original
post on this thread that our existing hash_any() wasn't portable, but
our current hash_any seems to be the Jenkins algorithm -- so I'm
confused. Part of the problem seems to be that, according to
https://en.wikipedia.org/wiki/Jenkins_hash_function there are 4 of
those. I don't know whether the one in pghashlib is the same one
we've implemented.

Kennel Marshall suggested xxhash as an endian-independent algorithm
upthread. Code for that is available under a 2-clause BSD license.

PostgreSQL page checksums use an algorithm based on, but not exactly,
FNV-1a. See storage/checksum_impl.h. The comments there say this
algorithm was chosen with speed in mind. Our version is not
endian-independent because it folds in 4-byte integers rather than
1-byte integers, but plain old FNV-1a *is* endian-independent and
could be used.

We also have an implementation of CRC32C in core - see port/pg_crc32.h
and port/pg_crc32c_sb8.c. It's not clear to me whether this is
Endian-independent or not, although there is stuff that depends on
WORDS_BIGENDIAN, so, uh, maybe?

Some other possibly-interesting links:
https://research.neustar.biz/2011/12/29/choosing-a-good-hash-function-part-2/
http://greenrobot.org/essentials/features/performant-hash-functions-for-java/comparison-of-hash-functions/
https://www.strchr.com/hash_functions

Thoughts?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2017-08-03 21:26:01 Re: Support for Secure Transport SSL library on macOS as OpenSSL alternative
Previous Message Andrew Dunstan 2017-08-03 21:08:44 Re: PL_stashcache, or, what's our minimum Perl version?