From: | Ants Aasma <ants(at)cybertec(dot)at> |
---|---|
To: | Atri Sharma <atri(dot)jiit(at)gmail(dot)com> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Bloom Filter lookup for hash joins |
Date: | 2013-06-26 11:08:20 |
Message-ID: | CA+CSw_uYOjCpDzdR75TZkk-ZMMKuWoKbcsx93-MswKDgkN-djA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Jun 26, 2013 at 9:46 AM, Atri Sharma <atri(dot)jiit(at)gmail(dot)com> wrote:
> I have been researching bloom filters and discussed it on IRC with
> RhodiumToad and David Fetter, and they pointed me to the various
> places that could potentially have bloom filters, apart from the
> places that already have them currently.
One idea that I had was to use them to do CLOG lookups from smaller
datastructures. You could store the list of aborted xids in bloom
filters. When a xid is not found in the filter, then it is known to
have committed, if the filter returns true, then you have to check the
real CLOG. This would achieve for example a 1:8 compression ratio at
99% commit rate and 1:160'000 false positive rate, or 1:16 compression
ratio at 1:400 false positive rate. Nothing too exciting, so I didn't
really develop the idea any further.
> I have been reading the current implementation of hash joins, and in
> ExecScanHashBucket, which I understand is the actual lookup function,
> we could potentially look at a bloom filter per bucket. Instead of
> actually looking up each hash value for the outer relation, we could
> just check the corresponding bloom filter for that bucket, and if we
> get a positive, then lookup the actual values i.e. continue with our
> current behaviour (since we could be looking at a false positive).
The problem here is that if the hash table is in memory, doing a hash
table lookup directly is likely to be cheaper than a bloom filter
lookup, even if the bloom filter fits into the processor cache and the
hash table doesn't (10 last level cache hits is slower than one cache
miss). Bloom filter will help when its not feasible to use an actual
hash table (lots of large keys), the real lookup is slow (e.g. an
index lookup), you are doing a lot of lookups to amortize the
construction cost and the condition is expected to be selective (most
lookups give a negative). There might be some dataware house types of
queries where it would help, but it seems like an awfully narrow use
case with a potential for performance regressions when the planner has
a bad estimate.
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2013-06-26 11:09:59 | Re: Review: Patch to compute Max LSN of Data Pages |
Previous Message | Szymon Guz | 2013-06-26 11:03:57 | Re: [PATCH] Fix conversion for Decimal arguments in plpython functions |