Extending array intersection ops to bloom indexes

From: Lauri Svan <lauri(dot)svan(at)rescomms(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: Extending array intersection ops to bloom indexes
Date: 2020-09-16 11:34:26
Message-ID: CAAsqVvWbz0n7JYMNRJCc_-10wFsVSmWz+Ws0L5gS99kFedz75w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

The current Bloom index implementation only supports = operator, but states
in the documentation that it could be extended to handle array intersection
or union queries in future. As the original authors are hanging out in this
list, I would like to discuss what such implementation might be.

I just started the investigation and most of what I am writing here is
likely faulty. What I currently ponder is if it were sufficient to OR the
hashed array values together and store in one go to the index, or if each
array value would need to be stored separately. The index insertion looks
somewhat costly, and it does not feel right to repeat that work for each
array element. The same would apply to the set membership queries - single
or many. If I am not totally mistaken, this depends on the number of
different hashing functions used, as each function would yield different
values. However, if I understood the operator class for the current bloom
index, we only give one hash function?

My own C skills are getting a bit rusty and I am new to pg codebase, but
I'll take that as a positive challenge. Any help in the algorithm and/or
implementation would be appreciated (even a small paid contribution would
not be out of question).

As to the problem I aim to solve, our small startup recently bumped into a
performance problem where I believe bloom index might help. We use arrays
extensively in our activity log, which stores a reference to pretty much
every record that changed during an activity. It is a somewhat heavy
structure, and while we index it using GIN indexes and partition by
timestamp, it already becomes slow to query with >10M rows. Surprisingly
the GIN index works slower than a full table scan to the partition within
the query time range. The table structure could be changed (e.g. unroll the
arrays, resulting in 10-20x the rows), but I'd like to exercise if the
problem could be solved with proper indexes, first.

I believe bloom index, supported by array intersection queries would be an
ideal solution - the hashed values would be neatly packed into one hash
(with the obvious need to do re-checks, esp. when the arrays have a lot of
values).

Thanks in advance,
Lauri

Browse pgsql-hackers by date

  From Date Subject
Next Message Asim Praveen 2020-09-16 11:43:14 Re: Parallelize stream replication process
Previous Message Tharakan, Robins 2020-09-16 10:55:04 RE: track_planning causing performance regression