From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Combining hash values |
Date: | 2016-07-31 01:15:55 |
Message-ID: | CAEepm=3rdgjfxW4cKvJ0OEmya2-34B0qHNG1xV0vK7TGPJGMUQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi hackers,
Andres Freund asked me off list what I thought about this part of
execGrouping.c, which builds a hash from the hashes of several datums
in a loop:
/* rotate hashkey left 1 bit at each step */
hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0);
...
hashkey ^= hkey;
I think we should consider improving it and putting it somewhere central.
1. The same code also appears in nodeHash.c, and we also need to do
the same thing in a couple of new projects that my EDB colleagues and
I are working on for proposal as 10.0 features, based on DSM-backed
hash tables. So I think it would make sense to define a hash_combine
function or macro to be reused by all of such places rather than
repeating it everywhere.
2. I suspect that this algorithm for combining hashes is weak, and
could amplify weaknesses in the hash functions feeding it. Compare
Boost's hash_combine, which does some more work before XORing:
seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
That constant approximates the golden ratio (as a fraction of the 32
bit hash space), and it also appears in our hash_any and hash_uint32
functions. I don't claim to understand the mathematics but I think
this may have to do with TACP section 6, theorem S and exercises 8 and
9, though I'm not sure if it's being used for the same purpose in
algorithms that add it (is it just some random bits? [1][2]) and
algorithms that multiply by it [3][4]. Perhaps we could write a
reusable hash_combine function/macro using existing code from
hashfunc.c, or use the formula from Boost, or something else, to
improve the quality of our hash combinations. It would be very
interesting to hear from someone with expertise in this area.
Hash indexes don't currently support multiple column keys. If they
ever do in future, they would also benefit from high quality
combination. Assuming we'd use the same algorithm there too, changing
the combination algorithm after we start persisting the results to
disk in hash indexes would obviously be difficult. There doesn't seem
to be any reason why we can't change the algorithm before then.
[1] http://stackoverflow.com/questions/4948780/magic-number-in-boosthash-combine
[2] https://xkcd.com/221/
[3] http://cstheory.stackexchange.com/questions/9639/how-did-knuth-derive-a
[4] http://community.haskell.org/~simonmar/base/src/Data-HashTable.html
--
Thomas Munro
http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Joshua D. Drake | 2016-07-31 05:17:17 | Re: [BUGS] BUG #14244: wrong suffix for pg_size_pretty() |
Previous Message | Peter Geoghegan | 2016-07-30 22:45:30 | Hash indexes and effective_cache_size in CREATE INDEX documentation |