Quick Links

Re: Horribly slow hash join

From:	Greg Stark <gsstark(at)mit(dot)edu>
To:	Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org>
Cc:	Bruno Wolff III <bruno(at)wolff(dot)to>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Jim C(dot) Nasby" <jim(at)nasby(dot)net>, <pgsql-performance(at)postgresql(dot)org>
Subject:	Re: Horribly slow hash join
Date:	2004-04-19 05:29:19
Message-ID:	874qrg7cy8.fsf@stark.xeocode.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Dennis Bjorklund <db(at)zigo(dot)dhs(dot)org> writes:

> On Sun, 18 Apr 2004, Bruno Wolff III wrote:
>
> > Another option would be to put the numbers into two int4s. For int4 or
> > smaller types one of these would be zero. int8s would be split between
> > the two. The hash function would then be defined on the two int4s.
>
> Sure, this is an internal calculation in the hash function. The only
> important thing is that the number 7 (for example) gives the same hash
> value no matter if it is an int2 or an int8 and that the hash function
> works well also for int8 numbers (which is does not today).

What's missing here is that the actual API for hash functions is that the data
type provides a function that hashes to 32 bit integers. Then the hash code
uses the 32 bit integer to crunch down to the actual number of buckets (using
mod).

The choice of 32 bit integers is purely arbitrary. As long as it's larger than
than the number of buckets in any sane hash table it's fine. 32 bits is
plenty.

I question the use of mod to crunch the hash value down though. In the case of
int4 the mapping to 32 bits is simply the identity. So the entire hash
function ends up being simply "input mod #buckets". It seems way too easy to
find real world data sets where many numbers will all be multiples of some
number. If that common divisor shares any factors with the number of buckets,
then the distribution will be very far from even with many empty buckets.

If the hash tables were made a power of two then it would be possible to mix
the bits of the 32 bit value and just mask off the unneeded bits. I've found
one page via google that mentions mixing bits in a hash function, but I would
look for a more serious treatment somewhere.

http://burtleburtle.net/bob/hash/doobs.html

Incidentally, this text claims mod is extremely slow compared to bit
manipulations. I don't know that that kind of cycle counting is really is a
factor for postgres though.

Also, incidentally, this text is interesting:

http://www.isthe.com/chongo/tech/comp/fnv/

--
greg

In response to

Re: Horribly slow hash join at 2004-04-19 04:43:16 from Dennis Bjorklund

Responses

Re: Horribly slow hash join at 2004-04-19 06:09:30 from Tom Lane

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Tom Lane	2004-04-19 06:09:30	Re: Horribly slow hash join
Previous Message	Dennis Bjorklund	2004-04-19 04:43:16	Re: Horribly slow hash join