Re: Bloom filter Pushdown Optimization for Merge Join

From: Zhihong Yu <zyu(at)yugabyte(dot)com>
To: Lyu Pan <lyu(dot)steve(dot)pan(at)gmail(dot)com>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Zheng Li <zhengli10(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Bloom filter Pushdown Optimization for Merge Join
Date: 2022-10-13 16:33:01
Message-ID: CALNJ-vRMdO5N0RmWNxQmKPRHY9dALcpNF8C-gGTyoHNUCOYgrw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Oct 13, 2022 at 7:30 AM Zhihong Yu <zyu(at)yugabyte(dot)com> wrote:

>
>
> On Wed, Oct 12, 2022 at 4:35 PM Zhihong Yu <zyu(at)yugabyte(dot)com> wrote:
>
>>
>>
>> On Wed, Oct 12, 2022 at 3:36 PM Lyu Pan <lyu(dot)steve(dot)pan(at)gmail(dot)com> wrote:
>>
>>> Hello Zhihong Yu & Tomas Vondra,
>>>
>>> Thank you so much for your review and feedback!
>>>
>>> We made some updates based on previous feedback and attached the new
>>> patch set. Due to time constraints, we didn't get to resolve all the
>>> comments, and we'll continue to improve this patch.
>>>
>>> > In this prototype, the cost model is based on an assumption that there
>>> is
>>> > a linear relationship between the performance gain from using a
>>> semijoin
>>> > filter and the estimated filtering rate:
>>> > % improvement to Merge Join cost = 0.83 * estimated filtering rate -
>>> 0.137.
>>> >
>>> > How were the coefficients (0.83 and 0.137) determined ?
>>> > I guess they were based on the results of running certain workload.
>>>
>>> Right, the coefficients (0.83 and 0.137) determined are based on some
>>> preliminary testings. The current costing model is pretty naive and
>>> we'll work on a more robust costing model in future work.
>>>
>>>
>>> > I agree, in principle, although I think the current logic / formula is
>>> a
>>> > bit too crude and fitted to the simple data used in the test. I think
>>> > this needs to be formulated as a regular costing issue, considering
>>> > stuff like cost of the hash functions, and so on.
>>> >
>>> > I think this needs to do two things:
>>> >
>>> > 1) estimate the cost of building the bloom filter - This shall depend
>>> on
>>> > the number of rows in the inner relation, number/cost of the hash
>>> > functions (which may be higher for some data types), etc.
>>> >
>>> > 2) estimate improvement for the probing branch - Essentially, we need
>>> to
>>> > estimate how much we save by filtering some of the rows, but this also
>>> > neeeds to include the cost of probing the bloom filter.
>>> >
>>> > This will probably require some improvements to the lib/bloomfilter, in
>>> > order to estimate the false positive rate - this may matter a lot for
>>> > large data sets and small work_mem values. The bloomfilter library
>>> > simply reduces the size of the bloom filter, which increases the false
>>> > positive rate. At some point it'll start reducing the benefit.
>>> >
>>>
>>> These suggestions make a lot of sense. The current costing model is
>>> definitely not good enough, and we plan to work on a more robust
>>> costing model as we continue to improve the patch.
>>>
>>>
>>> > OK. Could also build the bloom filter in shared memory?
>>> >
>>>
>>> We thought about this approach but didn't prefer this one because if
>>> all worker processes share the same bloom filter in shared memory, we
>>> need to frequently lock and unlock the bloom filter to avoid race
>>> conditions. So we decided to have each worker process create its own
>>> bloom filter.
>>>
>>>
>>> > IMHO we shouldn't make too many conclusions from these examples. Yes,
>>> it
>>> > shows merge join can be improved, but for cases where a hashjoin works
>>> > better so we wouldn't use merge join anyway.
>>> >
>>> > I think we should try constructing examples where either merge join
>>> wins
>>> > already (and gets further improved by the bloom filter), or would lose
>>> > to hash join and the bloom filter improves it enough to win.
>>> >
>>> > AFAICS that requires a join of two large tables - large enough that
>>> hash
>>> > join would need to be batched, or pre-sorted inputs (which eliminates
>>> > the explicit Sort, which is the main cost in most cases).
>>> >
>>> > The current patch only works with sequential scans, which eliminates
>>> the
>>> > second (pre-sorted) option. So let's try the first one - can we invent
>>> > an example with a join of two large tables where a merge join would
>>> win?
>>> >
>>> > Can we find such example in existing benchmarks like TPC-H/TPC-DS.
>>> >
>>>
>>> Agreed. The current examples are only intended to show us that using
>>> bloom filters in merge join could improve the merge join performance
>>> in some cases. We are working on testing more examples that merge join
>>> with bloom filter could out-perform hash join, which should be more
>>> persuasive.
>>>
>>>
>>> > The bloom filter is built by the first seqscan (on t0), and then used
>>> by
>>> > the second seqscan (on t1). But this only works because we always run
>>> > the t0 scan to completion (because we're feeding it into Sort) before
>>> we
>>> > start scanning t1.
>>> >
>>> > But when the scan on t1 switches to an index scan, it's over - we'd be
>>> > building the filter without being able to probe it, and when we finish
>>> > building it we no longer need it. So this seems pretty futile.
>>> >
>>> > It might still improve plans like
>>> >
>>> > -> Merge Join
>>> > Merge Cond: (t0.c1 = t1.c1)
>>> > SemiJoin Filter Created Based on: (t0.c1 = t1.c1)
>>> > SemiJoin Estimated Filtering Rate: 1.0000
>>> > -> Sort
>>> > Sort Key: t0.c1
>>> > -> Seq Scan on t0
>>> > -> Index Scan on t1
>>> >
>>> > But I don't know how common/likely that actually is. I'd expect to have
>>> > an index on both sides, but perhaps I'm wrong.
>>> >
>>> > This is why hashjoin seems like a more natural fit for the bloom
>>> filter,
>>> > BTW, because there we have a guarantee the inner relation is processed
>>> > first (so we know the bloom filter is fine and can be probed).
>>> >
>>>
>>> Great observation. The bloom filter only works if the first SeqScan
>>> always runs to completion before the second SeqScan starts.
>>> I guess one possible way to avoid futile bloom filter might be
>>> enforcing that the bloom filter only be used if both the outer/inner
>>> plans of the MergeJoin are Sort nodes, to guarantee the bloom filter
>>> is ready to use after processing one side of the join, but this may be
>>> too restrictive.
>>>
>>>
>>> > I don't know what improvements you have in mind exactly, but I think
>>> > it'd be good to show which node is building/using a bloom filter, and
>>> > then also some basic stats (size, number of hash functions, FPR, number
>>> > of probes, ...). This may require improvements to lib/bloomfilter,
>>> which
>>> > currently does not expose some of the details.
>>> >
>>>
>>> Along with the new patch set, we have added information to display
>>> which node is building/using a bloom filter (as well as the
>>> corresponding expressions), and some bloom filter basic stats. We'll
>>> add more related information (e.g. FPR) as we modify lib/bloomfilter
>>> implementation in future work.
>>>
>>>
>>> Thanks again for the great reviews and they are really useful! More
>>> feedback is always welcome and appreciated!
>>>
>>> Regards,
>>> Lyu Pan
>>> Amazon RDS/Aurora for PostgreSQL
>>>
>>> Hi,
>> For v1-0001-Support-semijoin-filter-in-the-planner-optimizer.patch :
>>
>> +
>> + /* want at least 1000 rows_filtered to avoid any nasty edge cases
>> */
>> + if (force_mergejoin_semijoin_filter ||
>> + (filtering_rate >= 0.35 && rows_filtered > 1000 &&
>> best_filter_clause >= 0))
>>
>> Currently rows_filtered is compared with a constant, should the constant
>> be made configurable ?
>>
>> + * improvement of 19.5%. This equation also concludes thata a
>> 17%
>>
>> Typo: thata
>>
>> + inner_md_array = palloc(sizeof(SemiJoinFilterExprMetadata) * num_md);
>> + if (!outer_md_array || !inner_md_array)
>> + {
>> + return 0; /* a stack array allocation failed */
>>
>> Should the allocated array be freed before returning ?
>>
>> For verify_valid_pushdown(), the parameters in comment don't match the
>> actual parameters. Please modify the comment.
>>
>> For is_fk_pk(), since the outer_key_list is fixed across the iterations,
>> I think all_vars can be computed outside of expressions_match_foreign_key().
>>
>> Cheers
>>
>
> Continuing
> with v1-0001-Support-semijoin-filter-in-the-planner-optimizer.patch
>
> For get_switched_clauses(), the returned List contains all the clauses.
> Yet the name suggests that only switched clauses are returned.
> Please rename the method to adjust_XX or rearrange_YY so that it is less
> likely to cause confusion.
>
> For depth_of_semijoin_target(), idx is only used inside the `if (num_exprs
> == get_appendrel_occluded_references(` block, you can move its declaration
> inside that if block.
>
> + outside_subq_rte = root->simple_rte_array[outside_subq_var->varno];
> +
> + /* System Vars have varattno < 0, don't bother */
> + if (outside_subq_var->varattno <= 0)
> + return 0;
>
> Since the check on outside_subq_var->varattno doesn't depend on
> outside_subq_rte, you can move the assignment to outside_subq_rte after the
> if check.
>
> Cheers
>

For v1-0002-Support-semijoin-filter-in-the-executor-for-non-para.patch, I
have a few comments.

+
+ if (IsA(node, SeqScanState) && ((SeqScanState *)
node)->apply_semijoin_filter
+ && !ExecScanUsingSemiJoinFilter((SeqScanState *) node,
econtext))
+ {
+ /* slot did not pass SemiJoinFilter, so skipping it. */
+ ResetExprContext(econtext);
+ continue;
+ }
+ return projectedSlot;

Since projectedSlot is only returned if slot passes SemiJoinFilter, you can
move the call to `ExecProject` after the above if block.

+ List *semijoin_filters; /* SemiJoinFilterJoinNodeState */
+ List *sj_scan_data; /* SemiJoinFilterScanNodeState */

Better use plural form in the comments.

Cheers

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2022-10-13 16:39:41 Re: [RFC] building postgres with meson - v13
Previous Message Adrian Klaver 2022-10-13 16:20:51 Exponentiation confusion