From: | Merlin Moncure <mmoncure(at)gmail(dot)com> |
---|---|
To: | Robert Klemme <shortcutter(at)googlemail(dot)com> |
Cc: | Stefan Keller <sfkeller(at)gmail(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Peter Geoghegan <peter(at)2ndquadrant(dot)com>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: Hash index use presently(?) discouraged since 2005: revive or bury it? |
Date: | 2011-09-19 18:43:06 |
Message-ID: | CAHyXU0xmVjP_vgFOpiUWBV=kwAT5CC90=1Pu-CrTtMKZc1f1eg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
On Mon, Sep 19, 2011 at 10:19 AM, Robert Klemme
<shortcutter(at)googlemail(dot)com> wrote:
> On Mon, Sep 19, 2011 at 4:04 PM, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>> On Sun, Sep 18, 2011 at 9:59 AM, Stefan Keller <sfkeller(at)gmail(dot)com> wrote:
>>> Merlin and Jeff,
>>>
>>> General remark again:It's hard for me to imagine that btree is
>>> superior for all the issues mentioned before. I still believe in hash
>>> index for primary keys and certain unique constraints where you need
>>> equality search and don't need ordering or range search.
>>
>> It is -- but please understand I'm talking about int32 tree vs hash.
>> Hashing as a technique of course is absolutely going to cream btree
>> for all kinds of data because of the advantages of working with
>> decomposed data -- we are all taught that in comp-sci 101 :-). The
>> debate here is not about the advantages of hashing, but the specific
>> implementation of the hash index used.
>>
>> Postgres's hash index implementation used to be pretty horrible -- it
>> stored the pre-hashed datum in the index which, while making it easier
>> to do certain things, made it horribly slow, and, for all intents and
>> purposes, useless. Somewhat recently,a lot of work was put in to fix
>> that -- the index now packs the hash code only which made it
>> competitive with btree and superior for larger keys. However, certain
>> technical limitations like lack of WAL logging and uniqueness hold
>> hash indexing back from being used like it really should be. In cases
>> where I really *do* need hash indexing, I do it in userland.
>>
>> create table foo
>> (
>> a_long_field text;
>> );
>> create index on foo(hash(a_long_field));
>>
>> select * from foo where hash(a_long_field) = hash(some_value) and
>> a_long_field = some_value;
>>
>> This technique works fine -- the main disadvantage is that enforcing
>> uniqueness is a PITA but since the standard index doesn't support it
>> either it's no great loss. I also have the option of getting
>> 'uniqueness' and being able to skip the equality operation if I
>> sacrifice some performance and choose a strong digest. Until the hash
>> index issues are worked out, I submit that this remains the go-to
>> method to do this.
>
> Is this approach (storing the hash code in a btree) really faster than
> a regular btree index on "a_long_field"? And if so, for which kind of
> data and load?
>
>> Now, my point here is that I've noticed that even with the latest
>> optimizations btree seems to still be superior to the hash indexing by
>> most metrics, so that:
>> create table foo
>> (
>> an_int_field int;
>> a_long_field text;
>> );
>>
>> create index on foo(an_int_field);
>> create index on foo using hash(a_long_field);
>>
>> On performance grounds alone, the btree index seems to be (from my
>> very limited testing) a better bet. So I'm conjecturing that the
>> current hash implementation should be replaced with a formalization of
>> the userland technique shown above -- when you request a hash index,
>> the database will silently hash your field and weave it into a btree.
>> It's a hybrid: a hashed btree.
>
> I'd rather call it a "btreefied hash" because you are storing a hash
> but in a btree structure. :-) But that's a detail. What I find
> worrying is that then there is a certain level of obscuring the real
> nature since "create index ... using hash" is not exactly creating a
> hash table.
>
>> To truly demonstrate if the technique
>> was effective though, it would have to be coded up -- it's only fair
>> to compare if the btree based hash is also double checking the value
>> in the heap which the standard hash index must do.
>
> Right.
so, i was curious, and decided to do some more performance testing. I
created a table like this:
create table foo as select r, r::text || 'acbdefghijklmnop' as b from
generate_series(1,10000000) r;
create index on foo(r);
create index on foo using hash(b);
to simulate the btree/hash hybrid, I cut a pgbench file like so (btree.sql):
\setrandom x 1 100000
select * from foo where r = :x and b=:x::text || 'acbdefghijklmnop'
and this: for the standard hash (hash.sql):
\setrandom x 1 100000
select * from foo where b=:x::text || 'acbdefghijklmnop'
pgbench -n -c2 -T 10 -f hash.sql etc
On my test machine, hybrid hash eeks out a slight win on in-cache tests:
merlin(at)mmoncure-ubuntu:~$ pgbench -n -c2 -T 100 -f btree.sql -p 5490
tps = 3250.793656 (excluding connections establishing)
vs
merlin(at)mmoncure-ubuntu:~$ pgbench -n -c2 -T 100 -f hash.sql -p 5490
tps = 3081.730400 (excluding connections establishing)
To make the test into i/o bound, I change the setrandom from 100000 to
10000000; this produced some unexpected results. The hash index is
pulling about double the tps (~80 vs ~ 40) over the hybrid version.
Well, unless my methodology is wrong, it's unfair to claim btree is
beating hash in 'all cases'. hm.
merlin
From | Date | Subject | |
---|---|---|---|
Next Message | Claudio Freire | 2011-09-19 18:53:07 | Re: Hash index use presently(?) discouraged since 2005: revive or bury it? |
Previous Message | Jeff Janes | 2011-09-19 16:15:45 | Re: Hash index use presently(?) discouraged since 2005: revive or bury it? |