Re: Hash Functions

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Joe Conway <mail(at)joeconway(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, amul sul <sulamul(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Hash Functions
Date: 2017-08-03 21:32:35
Message-ID: 20170803213235.x4u3omcxwamtmjxy@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2017-08-03 17:09:41 -0400, Robert Haas wrote:
> On Thu, Jun 1, 2017 at 2:25 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> > Just to clarify: I don't think it's a problem to do so for integers and
> > most other simple scalar types. There's plenty hash algorithms that are
> > endianess independent, and the rest is just a bit of care.
>
> Do you have any feeling for which of those endianness-independent hash
> functions might be a reasonable choice for us?

Not a strong / very informed one, TBH.

I'm not convinced it's worth trying to achieve this in the first place,
now that we "nearly" have the insert-via-parent feature. Isn't this a
lot of work, for very little practical gain? Having to select that when
switching architectures still seems unproblematic. People will just
about never switch endianess, so even a tiny performance & effort
overhead doesn't seem worth it to me.

Leaving that aside:

> https://github.com/markokr/pghashlib implements various hash functions
> for PostgreSQL, and claims that, of those implemented, crc32, Jenkins,
> lookup3be and lookup3le, md5, and siphash24 are endian-independent.

> An interesting point here is that Jeff Davis asserted in the original
> post on this thread that our existing hash_any() wasn't portable, but
> our current hash_any seems to be the Jenkins algorithm -- so I'm
> confused. Part of the problem seems to be that, according to
> https://en.wikipedia.org/wiki/Jenkins_hash_function there are 4 of
> those. I don't know whether the one in pghashlib is the same one
> we've implemented.

IIUC lookup3be/le from Marko's hashlib just has a endianess conversion
added. I'd personally not go for jenkins3, it's not particularly fast,
nor does it balance that out w/ being cryptographicaly secure.

> Kennel Marshall suggested xxhash as an endian-independent algorithm
> upthread. Code for that is available under a 2-clause BSD license.

Yea, that'd have been one of my suggestions too. Especially as I still
want to implement better compression using lz4, and that'll depend on
xxhash in turn.

> PostgreSQL page checksums use an algorithm based on, but not exactly,
> FNV-1a. See storage/checksum_impl.h. The comments there say this
> algorithm was chosen with speed in mind. Our version is not
> endian-independent because it folds in 4-byte integers rather than
> 1-byte integers, but plain old FNV-1a *is* endian-independent and
> could be used.

Non-SIMDed (which we hope to achieve with our implementation, which is
why we have separate compiler flags for that file) implementations of
FNV are, to my knowledge, not particularly fast. And the SIMD tricks
are, to my knowledge, not really applicable to the case at hand here. So
I'm not a fan of choosing FNV.

> We also have an implementation of CRC32C in core - see port/pg_crc32.h
> and port/pg_crc32c_sb8.c. It's not clear to me whether this is
> Endian-independent or not, although there is stuff that depends on
> WORDS_BIGENDIAN, so, uh, maybe?

The results should be endian independent. It depends on WORDS_BIGENDIAN
because we need different pre-computed tables depending on endianess.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2017-08-03 21:37:41 Re: Change in "policy" on dump ordering?
Previous Message Daniel Gustafsson 2017-08-03 21:26:01 Re: Support for Secure Transport SSL library on macOS as OpenSSL alternative