Re: Bloom Filter lookup for hash joins

From: Greg Stark <stark(at)mit(dot)edu>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Ants Aasma <ants(at)cybertec(dot)at>, Atri Sharma <atri(dot)jiit(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Bloom Filter lookup for hash joins
Date: 2013-07-23 14:02:31
Message-ID: CAM-w4HMwxHxqFn7Oc9LT4WSGYT4uOeLuYb28GD5OeeVZQVz0QQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

This exact idea was discussed a whole back. I think it was even
implemented.

The problem Tom raised at the time is that the memory usage of the bloom
filter implies smaller or less efficient hash table. It's difficult to
determine whether you're coming out ahead or behind.

I think it should be possible to figure this out though. Bloom fillers have
well understood math for the error rate given the size and number of hash
functions (and please read up on it and implement the optimal combination
for the target error rate, not just an wag) and it should be possible to
determine the resulting advantage.

Measuring the cost of the memory usage is harder but some empirical results
should give some idea. I expect the result to be wildly uneven one way or
the other so hopefully it doesn't matter of its not perfect. If it's close
then probably is not worth doing anyways.

I would suggest looking up the archives of the previous discussion. You
mind find the patch still usable. Iirc there's been no major changes to the
hash join code.

--
greg
On Jun 26, 2013 7:31 PM, "Jeff Janes" <jeff(dot)janes(at)gmail(dot)com> wrote:

> On Wed, Jun 26, 2013 at 5:01 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
>
>>
>> > The problem here is that if the hash table is in memory, doing a hash
>> > table lookup directly is likely to be cheaper than a bloom filter
>> > lookup, even if the bloom filter fits into the processor cache and the
>> > hash table doesn't (10 last level cache hits is slower than one cache
>> > miss). Bloom filter will help when its not feasible to use an actual
>> > hash table (lots of large keys), the real lookup is slow (e.g. an
>> > index lookup), you are doing a lot of lookups to amortize the
>> > construction cost and the condition is expected to be selective (most
>> > lookups give a negative). There might be some dataware house types of
>> > queries where it would help, but it seems like an awfully narrow use
>> > case with a potential for performance regressions when the planner has
>> > a bad estimate.
>>
>> Ok, sounds good. Cant we use bloom filters for the case where the hash
>> table doesnt fit in memory? Specifically, when reading from disk is
>> inevitable for accessing the hash table, we can use bloom filters for
>> deciding which keys to actually read from disk.
>
>
> I don't think that sounds all that promising. When the hash table does
> not fit in memory, it is either partitioned into multiple passes, each of
> which do fit in memory, or it chooses a different plan altogether. Do we
> know under what conditions a Bloom filter would be superior to those
> options, and could we reliably detect those conditions?
>
> Cheers,
>
> Jeff
>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2013-07-23 14:13:24 Re: Performance Improvement by reducing WAL for Update Operation
Previous Message Tom Lane 2013-07-23 14:00:18 Re: Performance problem in PLPgSQL