From: | Peter Geoghegan <pg(at)bowt(dot)ie> |
---|---|
To: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
Cc: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: [HACKERS] A design for amcheck heapam verification |
Date: | 2017-12-07 19:37:23 |
Message-ID: | CAH2-Wznm5ZOjS0_DJoWrcm9Us19gzbkm0aTKt5hHprvjHFVHpQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Nov 28, 2017 at 9:54 PM, Peter Geoghegan <pg(at)bowt(dot)ie> wrote:
> On Tue, Nov 28, 2017 at 9:50 PM, Michael Paquier
> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>> Would that address your concern? There would be an SQL interface, but
>>> it would be trivial.
>>
>> That's exactly what I think you should do, and mentioned so upthread.
>> A SQL interface can also show a good example of how developers can use
>> this API.
Attach revision, v5, adds a new test harness -- test_bloomfilter.
This can be used to experimentally verify that the meets the well
known "1% false positive rate with 9.6 bits per element" standard. It
manages to do exactly that:
postgres=# set client_min_messages = 'debug1';
SET
postgres=# SELECT test_bloomfilter(power => 23, nelements => 873813,
seed => -1, tests => 3);
DEBUG: beginning test #1...
DEBUG: bloom_work_mem (KB): 1024
DEBUG: false positives: 8630 (rate: 0.009876, proportion bits set:
0.517625, seed: 1373191603)
DEBUG: beginning test #2...
DEBUG: bloom_work_mem (KB): 1024
DEBUG: false positives: 8623 (rate: 0.009868, proportion bits set:
0.517623, seed: 406665822)
DEBUG: beginning test #3...
DEBUG: bloom_work_mem (KB): 1024
WARNING: false positives: 8840 (rate: 0.010117, proportion bits set:
0.517748, seed: 398116374)
test_bloomfilter
------------------
(1 row)
Here, we repeat the same test 3 times, varying only the seed value
used for each run.
The last message is a WARNING because we exceed the 1% threshold
(hard-coded into test_bloomfilter.c), though only by a tiny margin,
due only to random variations in seed value. We round up to 10 bits
per element for the regression tests. That's where the *actual*
"nelements" argument comes from within the tests, so pg_regress tests
should never see the WARNING (if they do, that counts as a failure).
I've experimentally observed that we get the 1% false positive rate
with any possible bitset when "nelements" works out at 9.6 bitset bits
per element. Inter-run variation is tiny. With 50 tests, I didn't
observe these same Bloom filter parameters produce a false positive
rate that came near to 1.1%. 1.01% or 1.02% was about as bad as it
got.
There is a fairly extensive README, which I hope will clear up the
theory behind the bloomfilter.c strategy on bitset size and false
positives. Also, there was a regression that I had to fix in
bloomfilter.c, in seeding. It didn't reliably cause variation in the
false positives. And, there was bitrot with the documentation that I
fixed up.
--
Peter Geoghegan
Attachment | Content-Type | Size |
---|---|---|
0001-Add-Bloom-filter-data-structure-implementation.patch | text/x-patch | 26.4 KB |
0002-Add-amcheck-verification-of-indexes-against-heap.patch | text/x-patch | 38.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Joshua D. Drake | 2017-12-07 19:38:51 | Re: Logical replication without a Primary Key |
Previous Message | Peter Eisentraut | 2017-12-07 19:35:04 | Re: Transform for pl/perl |