Quick Links

Re: MergeJoin beats HashJoin in the case of multiple hash clauses

From:	Andrei Lepikhov <lepihov(at)gmail(dot)com>
To:	Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru>
Subject:	Re: MergeJoin beats HashJoin in the case of multiple hash clauses
Date:	2024-07-08 12:45:15
Message-ID:	c99dc067-9fec-4649-a0f5-62a771404972@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 3/11/2023 23:43, Tomas Vondra wrote:
> On 9/11/23 10:04, Lepikhov Andrei wrote:
> * Determine bucketsize fraction and MCV frequency for the inner
> * relation. We use the smallest bucketsize or MCV frequency estimated
> * for any individual hashclause; this is undoubtedly conservative.
>
> I'm sure this may lead to inflated cost for "good" cases (where the
> actual bucket size really is a product), which may push the optimizer to
> use the less efficient/slower join method.
Yes, It was contradictory idea, though.
> IMHO the only principled way forward is to get a better ndistinct
> estimate (which this implicitly does), perhaps by using extended
> statistics. I haven't tried, but I guess it'd need to extract the
> clauses for the inner side, and call estimate_num_groups() on it.
And I've done it. Sorry for so long response. This patch employs of
extended statistics for estimation of the HashJoin bucket_size. In
addition, I describe the idea in more convenient form here [1].
Obviously, it needs the only ndistinct to make a prediction that allows
to reduce computational cost of this statistic.

[1]
https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web

--
regards, Andrei Lepikhov

Attachment	Content-Type	Size
0001-Use-extended-statistics-for-precise-estimation-of-bu.patch	text/plain	8.2 KB

In response to

Re: MergeJoin beats HashJoin in the case of multiple hash clauses at 2023-11-03 16:43:26 from Tomas Vondra

Responses

Re: MergeJoin beats HashJoin in the case of multiple hash clauses at 2024-10-08 05:00:27 from Andrei Lepikhov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Langote	2024-07-08 12:57:00	Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions
Previous Message	Tom Lane	2024-07-08 12:44:27	Re: Is it expected behavior index only scan shows "OPERATOR(pg_catalog." for EXPLAIN?