Re: Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !

From: Fabien COELHO <coelho(at)cri(dot)ensmp(dot)fr>
To: Avinash Kumar <avinash(dot)vallarapu(at)gmail(dot)com>
Cc: PostgreSQL Developers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Bloom Indexes - bit array length and the total number of bits (or hash functions ?? ) !
Date: 2019-06-10 06:24:28
Message-ID: alpine.DEB.2.21.1906100758230.9922@lancre
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


> But the 2 direct questions i have are :
>
> 1. What is the structure of the Bloom Index ? Can you please let me know
> what are the fields of a Bloom Index ? Is it just the Item Pointer
> and BloomSignatureWord ?

I'm not sure of Postgres actual implementation, I have just looked at the
underlying hash functions implementation.

> When i describe my bloom index it looks like following.
>
> postgres=# \d+ foo.idx_bloom_bar
> Index "foo.idx_bloom_bar"
> Column | Type | Key? | Definition | Storage | Stats target
> ---------+---------+------+------------+---------+--------------
> id | integer | yes | id | plain |
> dept | integer | yes | dept | plain |
> id2 | integer | yes | id2 | plain |
> id3 | integer | yes | id3 | plain |
> id4 | integer | yes | id4 | plain |
> id5 | integer | yes | id5 | plain |
> id6 | integer | yes | id6 | plain |
> zipcode | integer | yes | zipcode | plain |
> bloom, for table "foo.bar"

The bloom index associates a signature, i.e. a bitfield the size of which
is specified by the parameter "length", to the tuple location. The
bitfield is computed by hashing the value of columns which are listed
above in the index definition. The many values are somehow compressed into
the small signature.

> Options: length=64, col1=4, col2=4, col3=4, col4=4, col5=4, col6=4, col7=4,
> col8=4

I doubt that these parameters are good. The is no point to include a
unique column into a bloom index. If you look at my blog, the number of
bits associated to each field should depend on the expected selectivity,
which depends on the entropy of each field and the signature size. The
column entropy can be computed with a query.

> 2. If we are storing just one signature word per row, how is this
> aggregated for all column values of that row into one signature in high
> level ?

The aggregation, if performed, is not very useful in practice because it
can only be effective on a few (first) bits, which are randomly computed
anyway and the value of the query is not likely to hit them.

Fundamentally all bitfields are scanned to extract which tuples are
possibly of interest, i.e. are not excluded by the index. The "full scan"
of the bloom index is not a bad thing if it is much smaller than the table
itself.

> For example, if length = 64, does it mean that a bit array of 64 bits is
> generated per each row ?

Yes.

> If col1=4, does it mean the value of col1 is passed to 4 hash functions
> that generate 4 positions that can be set to 1 in the bit array ?

Yes.

Try to apply the formula in the blog to see what you get for your
parameters, but it is likely to be < 4.

--
Fabien.

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alex 2019-06-10 06:45:25 Why to index a "Recently DEAD" tuple when creating index
Previous Message Zhenghua Lyu 2019-06-10 06:00:57 Questions of 'for update'