From: | Robert Haas <robertmhaas(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Joe Conway <mail(at)joeconway(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>, Jeff Davis <pgsql(at)j-davis(dot)com>, Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>, amul sul <sulamul(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Hash Functions |
Date: | 2017-08-03 21:09:41 |
Message-ID: | CA+TgmoYRYrBk8+RU1sd=GGirvtSM1-Su1VuB4zsuJy0h-dDH9w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Jun 1, 2017 at 2:25 PM, Andres Freund <andres(at)anarazel(dot)de> wrote:
> Just to clarify: I don't think it's a problem to do so for integers and
> most other simple scalar types. There's plenty hash algorithms that are
> endianess independent, and the rest is just a bit of care.
Do you have any feeling for which of those endianness-independent hash
functions might be a reasonable choice for us?
https://github.com/markokr/pghashlib implements various hash functions
for PostgreSQL, and claims that, of those implemented, crc32, Jenkins,
lookup3be and lookup3le, md5, and siphash24 are endian-independent.
An interesting point here is that Jeff Davis asserted in the original
post on this thread that our existing hash_any() wasn't portable, but
our current hash_any seems to be the Jenkins algorithm -- so I'm
confused. Part of the problem seems to be that, according to
https://en.wikipedia.org/wiki/Jenkins_hash_function there are 4 of
those. I don't know whether the one in pghashlib is the same one
we've implemented.
Kennel Marshall suggested xxhash as an endian-independent algorithm
upthread. Code for that is available under a 2-clause BSD license.
PostgreSQL page checksums use an algorithm based on, but not exactly,
FNV-1a. See storage/checksum_impl.h. The comments there say this
algorithm was chosen with speed in mind. Our version is not
endian-independent because it folds in 4-byte integers rather than
1-byte integers, but plain old FNV-1a *is* endian-independent and
could be used.
We also have an implementation of CRC32C in core - see port/pg_crc32.h
and port/pg_crc32c_sb8.c. It's not clear to me whether this is
Endian-independent or not, although there is stuff that depends on
WORDS_BIGENDIAN, so, uh, maybe?
Some other possibly-interesting links:
https://research.neustar.biz/2011/12/29/choosing-a-good-hash-function-part-2/
http://greenrobot.org/essentials/features/performant-hash-functions-for-java/comparison-of-hash-functions/
https://www.strchr.com/hash_functions
Thoughts?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Gustafsson | 2017-08-03 21:26:01 | Re: Support for Secure Transport SSL library on macOS as OpenSSL alternative |
Previous Message | Andrew Dunstan | 2017-08-03 21:08:44 | Re: PL_stashcache, or, what's our minimum Perl version? |