From: | Greg Stark <gsstark(at)mit(dot)edu> |
---|---|
To: | Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> |
Cc: | Bruno Wolff III <bruno(at)wolff(dot)to>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, <pgsql-performance(at)postgresql(dot)org> |
Subject: | Re: Horribly slow hash join |
Date: | 2004-04-19 05:29:19 |
Message-ID: | 874qrg7cy8.fsf@stark.xeocode.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:
> On Sun, 18 Apr 2004, Bruno Wolff III wrote:
>
> > Another option would be to put the numbers into two int4s. For int4 or
> > smaller types one of these would be zero. int8s would be split between
> > the two. The hash function would then be defined on the two int4s.
>
> Sure, this is an internal calculation in the hash function. The only
> important thing is that the number 7 (for example) gives the same hash
> value no matter if it is an int2 or an int8 and that the hash function
> works well also for int8 numbers (which is does not today).
What's missing here is that the actual API for hash functions is that the data
type provides a function that hashes to 32 bit integers. Then the hash code
uses the 32 bit integer to crunch down to the actual number of buckets (using
mod).
The choice of 32 bit integers is purely arbitrary. As long as it's larger than
than the number of buckets in any sane hash table it's fine. 32 bits is
plenty.
I question the use of mod to crunch the hash value down though. In the case of
int4 the mapping to 32 bits is simply the identity. So the entire hash
function ends up being simply "input mod #buckets". It seems way too easy to
find real world data sets where many numbers will all be multiples of some
number. If that common divisor shares any factors with the number of buckets,
then the distribution will be very far from even with many empty buckets.
If the hash tables were made a power of two then it would be possible to mix
the bits of the 32 bit value and just mask off the unneeded bits. I've found
one page via google that mentions mixing bits in a hash function, but I would
look for a more serious treatment somewhere.
http://burtleburtle.net/bob/hash/doobs.html
Incidentally, this text claims mod is extremely slow compared to bit
manipulations. I don't know that that kind of cycle counting is really is a
factor for postgres though.
Also, incidentally, this text is interesting:
http://www.isthe.com/chongo/tech/comp/fnv/
--
greg
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2004-04-19 06:09:30 | Re: Horribly slow hash join |
Previous Message | Dennis Bjorklund | 2004-04-19 04:43:16 | Re: Horribly slow hash join |