Re: Hash indexes

From: "Dann Corbit" <DCorbit(at)connx(dot)com>
To: "Greg Stark" <gsstark(at)mit(dot)edu>, "Hannu Krosing" <hannu(at)skype(dot)net>
Cc: "Andrew Dunstan" <andrew(at)dunslane(dot)net>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Alvaro Herrera" <alvherre(at)commandprompt(dot)com>, "Jim C(dot) Nasby" <jnasby(at)pervasive(dot)com>, "Luke Lonergan" <LLonergan(at)greenplum(dot)com>, <mark(at)mark(dot)mielke(dot)cc>, "Bruce Momjian" <bruce(at)momjian(dot)us>, "Jie Zhang" <jzhang(at)greenplum(dot)com>, "Gavin Sherry" <swm(at)linuxworld(dot)com(dot)au>, <pgsql-hackers(at)postgresql(dot)org>, <mikio(at)users(dot)sourceforge(dot)net'>, "Julienne Walker" <happyfrosty(at)hotmail(dot)com>
Subject: Re: Hash indexes
Date: 2006-08-02 00:45:48
Message-ID: D425483C2C5C9F49B5B7A41F8944154757DADB@postal.corporate.connx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: pgsql-hackers-owner(at)postgresql(dot)org [mailto:pgsql-hackers-
> owner(at)postgresql(dot)org] On Behalf Of Greg Stark
> Sent: Tuesday, August 01, 2006 4:42 PM
> To: Hannu Krosing
> Cc: Andrew Dunstan; Gregory Stark; Tom Lane; Alvaro Herrera; Jim C. Nasby;
> Luke Lonergan; mark(at)mark(dot)mielke(dot)cc; Bruce Momjian; Jie Zhang; Gavin
> Sherry; pgsql-hackers(at)postgresql(dot)org
> Subject: Re: [HACKERS] Hash indexes
>
> Hannu Krosing <hannu(at)skype(dot)net> writes:
>
> > Ühel kenal päeval, T, 2006-08-01 kell 10:54, kirjutas Andrew Dunstan:
> > > Gregory Stark wrote:
> > > >
> > > > I looked a while back and was suspicious about the actual hash
> functions too.
> > > > It seemed like a lot of them were vastly suboptimal. That would mean
> we're
> > > > often dealing with mostly empty and mostly full buckets instead of
> well
> > > > distributed hash tables.
> > > >
> > > >
> > > >
> > >
> > > This is now sounding like a lot of low hanging fruit ... highly
> > > performant hash indexed tables could possibly be a very big win.
> > >
> >
> > Are you sure about the badness of our hash functions ?
> >
> > I just tested and hashtext(text) has about 1.4% of collisions on about
> > 120M distinct texts, which is not bad considering thet total space for
> > hashes is 4G, meaning that 120M covers itself already 3% of possible
> > hash space.
>
> I don't recall exactly what triggered my suspicions, but I think it had
> more
> to do with things like how int4 and int8 were mapped to hash codes and
> what
> the hash function looks like that compresses the hash codes into the
> actual
> bucket. IIRC it's just the low order bits for int8 and it's the actual
> int4
> which then only takes the low order bits. I seem to recall wondering what
> would happen if you had, say, a list of /24 network addresses for example.
>
> I didn't actually do any tests though so it's quite possible I missed
> something that mitigated the issues I feared.

There is a program called QDBM:
http://qdbm.sourceforge.net/

That has very effective disk based hashing. A clever extension that he uses is to use btrees to chain hash collisions (very similar to an idea I like to use from time to time which is using a hash table of skiplists). The project is LGPL but it is possible that Mikio Hirabayashi could be talked into letting the on disk hash table portion be made available with a Berkeley style license.

As for effectiveness of hash functions, there are test programs on this page:
http://www.burtleburtle.net/bob/hash/index.html

and this hash library:
http://eternallyconfuzzled.com/libs/jsw_hlib.html
has statistics built into it (you can easily add and test your own hash functions using the provided API).

> --
> greg
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match

Browse pgsql-hackers by date

  From Date Subject
Next Message Joshua D. Drake 2006-08-02 00:46:58 Re: [HACKERS] Replication Documentation
Previous Message Alvaro Herrera 2006-08-02 00:39:08 Re: [HACKERS] Replication Documentation