Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
Subject: Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Date: 2024-10-08 05:00:27
Message-ID: 6b5c342d-45b8-4e01-8ff1-3032b80564ee@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 7/8/24 19:45, Andrei Lepikhov wrote:
> On 3/11/2023 23:43, Tomas Vondra wrote:
>> On 9/11/23 10:04, Lepikhov Andrei wrote:
>>   * Determine bucketsize fraction and MCV frequency for the inner
>>   * relation. We use the smallest bucketsize or MCV frequency estimated
>>   * for any individual hashclause; this is undoubtedly conservative.
>>
>> I'm sure this may lead to inflated cost for "good" cases (where the
>> actual bucket size really is a product), which may push the optimizer to
>> use the less efficient/slower join method.
> Yes, It was contradictory idea, though.
>> IMHO the only principled way forward is to get a better ndistinct
>> estimate (which this implicitly does), perhaps by using extended
>> statistics. I haven't tried, but I guess it'd need to extract the
>> clauses for the inner side, and call estimate_num_groups() on it.
> And I've done it. Sorry for so long response. This patch employs of
> extended statistics for estimation of the HashJoin bucket_size. In
> addition, I describe the idea in more convenient form here [1].
> Obviously, it needs the only ndistinct to make a prediction that allows
> to reduce computational cost of this statistic.
Minor change to make cfbot happier.

--
regards, Andrei Lepikhov

Attachment Content-Type Size
v1-0001-Use-extended-statistics-for-precise-estimation-of.patch text/x-patch 8.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message jian he 2024-10-08 05:11:00 Re: POC, WIP: OR-clause support for indexes
Previous Message Michael Paquier 2024-10-08 04:46:23 Re: per backend I/O statistics