Quick Links

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From:	"Lepikhov Andrei" <a(dot)lepikhov(at)postgrespro(dot)ru>
To:	"Andy Fan" <zhihui(dot)fan1213(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject:	Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Date:	2023-09-11 08:04:22
Message-ID:	84114cba-97e9-49f8-952f-04cf321125c1@app.fastmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Mon, Sep 11, 2023, at 11:51 AM, Andy Fan wrote:
> Hi,
>
> On Thu, Jun 15, 2023 at 4:30 PM Andrey Lepikhov
> <a(dot)lepikhov(at)postgrespro(dot)ru> wrote:
>> Hi, all.
>>
>> Some of my clients use JOIN's with three - four clauses. Quite
>> frequently, I see complaints on unreasonable switch of JOIN algorithm to
>> Merge Join instead of Hash Join. Quick research have shown one weak
>> place - estimation of an average bucket size in final_cost_hashjoin (see
>> q2.sql in attachment) with very conservative strategy.
>> Unlike estimation of groups, here we use smallest ndistinct value across
>> all buckets instead of multiplying them (or trying to make multivariate
>> analysis).
>> It works fine for the case of one clause. But if we have many clauses,
>> and if each has high value of ndistinct, we will overestimate average
>> size of a bucket and, as a result, prefer to use Merge Join. As the
>> example in attachment shows, it leads to worse plan than possible,
>> sometimes drastically worse.
>> I assume, this is done with fear of functional dependencies between hash
>> clause components. But as for me, here we should go the same way, as
>> estimation of groups.
>
> I can reproduce the visitation you want to improve and verify the patch
> can do it expectedly. I think this is a right thing to do.
>
>> The attached patch shows a sketch of the solution.
>
> I understand that this is a sketch of the solution, but the below
> changes still
> make me confused.
>
> + if (innerbucketsize > virtualbuckets)
> + innerbucketsize = 1.0 / virtualbuckets;
>
> innerbucketsize is a fraction of rows in all the rows, so it is between
> 0.0 and 1.0.
> and virtualbuckets is the number of buckets in total (when considered
> the mutli
> batchs), how is it possible for 'innerbucketsize > virtualbuckets' ?
> Am
> I missing something?

You are right here. I've made a mistake here. Changed diff is in attachment.

--
Regards,
Andrei Lepikhov

Attachment	Content-Type	Size
fix_bucketsize_v2.diff	application/octet-stream	680 bytes

In response to

Re: MergeJoin beats HashJoin in the case of multiple hash clauses at 2023-09-11 04:51:03 from Andy Fan

Responses

Re: MergeJoin beats HashJoin in the case of multiple hash clauses at 2023-11-03 16:43:26 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Kuwamura Masaki	2023-09-11 08:49:46	Re: pg_rewind with cascade standby doesn't work well
Previous Message	Michael Paquier	2023-09-11 07:24:58	Re: pg_logical_emit_message() misses a XLogFlush()