From: | Greg Stark <gsstark(at)mit(dot)edu> |
---|---|
To: | Teodor Sigaev <teodor(at)sigaev(dot)ru> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Bloom index |
Date: | 2010-01-18 02:29:03 |
Message-ID: | 407d949e1001171829l2e79b32av8413e5ba59e63440@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
2010/1/13 Teodor Sigaev <teodor(at)sigaev(dot)ru>:
>> This is pretty darn slick. I haven't looked at the code yet, but the
>
> It's just a prototype and/or proof of concept
The only thing that jumps out at me from the code itself is the use of
rand() and srand() which seems unacceptable. At the very least the
srand(attno) step should be precalculated and stored in the metapage.
At least that would make it safe against data type hash functions
which could call rand() themselves.
However this doesn't seem like a very good use of bloom filters to me.
Here your sets are always the same size, the n columns, and the order
is significant. What you're testing is whether the hash value for a
specific column is present in your "set" of hash values for the
columns of that row.
Bloom filters need to be sized to have n bits per set element to
maintain a targeted false positive rate. And that false positive rate
would be higher than just having an n bit hash for each set element.
Normally they have the advantage that you can test for membership
anywhere in the set quickly -- but in this case you actually only want
to test a specific position in the set anyways.
So I think you can get a lower false positive rate by just truncating
the each column's hash value to n bits and storing an array of them.
--
greg
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2010-01-18 02:30:25 | pgsql: Fix portalmem.c to avoid keeping a dangling pointer to a cached |
Previous Message | Tom Lane | 2010-01-18 02:10:36 | Re: AtAbort_Portsl problem |