Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
Subject: Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Date: 2024-07-08 12:45:15
Message-ID: c99dc067-9fec-4649-a0f5-62a771404972@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/11/2023 23:43, Tomas Vondra wrote:
> On 9/11/23 10:04, Lepikhov Andrei wrote:
> * Determine bucketsize fraction and MCV frequency for the inner
> * relation. We use the smallest bucketsize or MCV frequency estimated
> * for any individual hashclause; this is undoubtedly conservative.
>
> I'm sure this may lead to inflated cost for "good" cases (where the
> actual bucket size really is a product), which may push the optimizer to
> use the less efficient/slower join method.
Yes, It was contradictory idea, though.
> IMHO the only principled way forward is to get a better ndistinct
> estimate (which this implicitly does), perhaps by using extended
> statistics. I haven't tried, but I guess it'd need to extract the
> clauses for the inner side, and call estimate_num_groups() on it.
And I've done it. Sorry for so long response. This patch employs of
extended statistics for estimation of the HashJoin bucket_size. In
addition, I describe the idea in more convenient form here [1].
Obviously, it needs the only ndistinct to make a prediction that allows
to reduce computational cost of this statistic.

[1]
https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web

--
regards, Andrei Lepikhov

Attachment Content-Type Size
0001-Use-extended-statistics-for-precise-estimation-of-bu.patch text/plain 8.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2024-07-08 12:57:00 Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions
Previous Message Tom Lane 2024-07-08 12:44:27 Re: Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?