Re: Bloom filter Pushdown Optimization for Merge Join

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Lyu Pan <lyu(dot)steve(dot)pan(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Zheng Li <zhengli10(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Bloom filter Pushdown Optimization for Merge Join
Date: 2022-10-12 23:35:14
Message-ID: CALNJ-vS3KG8vEL7jDR1pDQrgQDUJX5-m9C6_1Tp9Wg25J2nNsQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 12, 2022 at 3:36 PM Lyu Pan <lyu(dot)steve(dot)pan(at)gmail(dot)com> wrote:

> Hello Zhihong Yu & Tomas Vondra,
>
> Thank you so much for your review and feedback!
>
> We made some updates based on previous feedback and attached the new
> patch set. Due to time constraints, we didn't get to resolve all the
> comments, and we'll continue to improve this patch.
>
> > In this prototype, the cost model is based on an assumption that there is
> > a linear relationship between the performance gain from using a semijoin
> > filter and the estimated filtering rate:
> > % improvement to Merge Join cost = 0.83 * estimated filtering rate -
> 0.137.
> >
> > How were the coefficients (0.83 and 0.137) determined ?
> > I guess they were based on the results of running certain workload.
>
> Right, the coefficients (0.83 and 0.137) determined are based on some
> preliminary testings. The current costing model is pretty naive and
> we'll work on a more robust costing model in future work.
>
>
> > I agree, in principle, although I think the current logic / formula is a
> > bit too crude and fitted to the simple data used in the test. I think
> > this needs to be formulated as a regular costing issue, considering
> > stuff like cost of the hash functions, and so on.
> >
> > I think this needs to do two things:
> >
> > 1) estimate the cost of building the bloom filter - This shall depend on
> > the number of rows in the inner relation, number/cost of the hash
> > functions (which may be higher for some data types), etc.
> >
> > 2) estimate improvement for the probing branch - Essentially, we need to
> > estimate how much we save by filtering some of the rows, but this also
> > neeeds to include the cost of probing the bloom filter.
> >
> > This will probably require some improvements to the lib/bloomfilter, in
> > order to estimate the false positive rate - this may matter a lot for
> > large data sets and small work_mem values. The bloomfilter library
> > simply reduces the size of the bloom filter, which increases the false
> > positive rate. At some point it'll start reducing the benefit.
> >
>
> These suggestions make a lot of sense. The current costing model is
> definitely not good enough, and we plan to work on a more robust
> costing model as we continue to improve the patch.
>
>
> > OK. Could also build the bloom filter in shared memory?
> >
>
> We thought about this approach but didn't prefer this one because if
> all worker processes share the same bloom filter in shared memory, we
> need to frequently lock and unlock the bloom filter to avoid race
> conditions. So we decided to have each worker process create its own
> bloom filter.
>
>
> > IMHO we shouldn't make too many conclusions from these examples. Yes, it
> > shows merge join can be improved, but for cases where a hashjoin works
> > better so we wouldn't use merge join anyway.
> >
> > I think we should try constructing examples where either merge join wins
> > already (and gets further improved by the bloom filter), or would lose
> > to hash join and the bloom filter improves it enough to win.
> >
> > AFAICS that requires a join of two large tables - large enough that hash
> > join would need to be batched, or pre-sorted inputs (which eliminates
> > the explicit Sort, which is the main cost in most cases).
> >
> > The current patch only works with sequential scans, which eliminates the
> > second (pre-sorted) option. So let's try the first one - can we invent
> > an example with a join of two large tables where a merge join would win?
> >
> > Can we find such example in existing benchmarks like TPC-H/TPC-DS.
> >
>
> Agreed. The current examples are only intended to show us that using
> bloom filters in merge join could improve the merge join performance
> in some cases. We are working on testing more examples that merge join
> with bloom filter could out-perform hash join, which should be more
> persuasive.
>
>
> > The bloom filter is built by the first seqscan (on t0), and then used by
> > the second seqscan (on t1). But this only works because we always run
> > the t0 scan to completion (because we're feeding it into Sort) before we
> > start scanning t1.
> >
> > But when the scan on t1 switches to an index scan, it's over - we'd be
> > building the filter without being able to probe it, and when we finish
> > building it we no longer need it. So this seems pretty futile.
> >
> > It might still improve plans like
> >
> > -> Merge Join
> > Merge Cond: (t0.c1 = t1.c1)
> > SemiJoin Filter Created Based on: (t0.c1 = t1.c1)
> > SemiJoin Estimated Filtering Rate: 1.0000
> > -> Sort
> > Sort Key: t0.c1
> > -> Seq Scan on t0
> > -> Index Scan on t1
> >
> > But I don't know how common/likely that actually is. I'd expect to have
> > an index on both sides, but perhaps I'm wrong.
> >
> > This is why hashjoin seems like a more natural fit for the bloom filter,
> > BTW, because there we have a guarantee the inner relation is processed
> > first (so we know the bloom filter is fine and can be probed).
> >
>
> Great observation. The bloom filter only works if the first SeqScan
> always runs to completion before the second SeqScan starts.
> I guess one possible way to avoid futile bloom filter might be
> enforcing that the bloom filter only be used if both the outer/inner
> plans of the MergeJoin are Sort nodes, to guarantee the bloom filter
> is ready to use after processing one side of the join, but this may be
> too restrictive.
>
>
> > I don't know what improvements you have in mind exactly, but I think
> > it'd be good to show which node is building/using a bloom filter, and
> > then also some basic stats (size, number of hash functions, FPR, number
> > of probes, ...). This may require improvements to lib/bloomfilter, which
> > currently does not expose some of the details.
> >
>
> Along with the new patch set, we have added information to display
> which node is building/using a bloom filter (as well as the
> corresponding expressions), and some bloom filter basic stats. We'll
> add more related information (e.g. FPR) as we modify lib/bloomfilter
> implementation in future work.
>
>
> Thanks again for the great reviews and they are really useful! More
> feedback is always welcome and appreciated!
>
> Regards,
> Lyu Pan
> Amazon RDS/Aurora for PostgreSQL
>
> Hi,
For v1-0001-Support-semijoin-filter-in-the-planner-optimizer.patch :

+
+ /* want at least 1000 rows_filtered to avoid any nasty edge cases */
+ if (force_mergejoin_semijoin_filter ||
+ (filtering_rate >= 0.35 && rows_filtered > 1000 &&
best_filter_clause >= 0))

Currently rows_filtered is compared with a constant, should the constant be
made configurable ?

+ * improvement of 19.5%. This equation also concludes thata a
17%

Typo: thata

+ inner_md_array = palloc(sizeof(SemiJoinFilterExprMetadata) * num_md);
+ if (!outer_md_array || !inner_md_array)
+ {
+ return 0; /* a stack array allocation failed */

Should the allocated array be freed before returning ?

For verify_valid_pushdown(), the parameters in comment don't match the
actual parameters. Please modify the comment.

For is_fk_pk(), since the outer_key_list is fixed across the iterations, I
think all_vars can be computed outside of expressions_match_foreign_key().

Cheers

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2022-10-13 00:33:49 Re: Issue in GIN fast-insert: XLogBeginInsert + Read/LockBuffer ordering
Previous Message Tom Lane 2022-10-12 23:24:53 Re: how to correctly react on exception in pfree function?