From: | Andrei Lepikhov <lepihov(at)gmail(dot)com> |
---|---|
To: | Alexander Korotkov <aekorotkov(at)gmail(dot)com> |
Cc: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, Andy Fan <zhihui(dot)fan1213(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "a(dot)rybakina" <a(dot)rybakina(at)postgrespro(dot)ru> |
Subject: | Re: MergeJoin beats HashJoin in the case of multiple hash clauses |
Date: | 2025-03-03 08:24:40 |
Message-ID: | 8750fa3f-43b6-40db-803f-d6ae471384ef@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 17/2/2025 01:34, Alexander Korotkov wrote:
> Hi, Andrei!
>
> On Tue, Oct 8, 2024 at 8:00 AM Andrei Lepikhov <lepihov(at)gmail(dot)com> wrote:
> Thank you for your work on this subject. I agree with the general
> direction. While everyone has used conservative estimates for a long
> time, it's better to change them only when we're sure about it.
> However, I'm still not sure I get the conservatism.
>
> if (innerbucketsize > thisbucketsize)
> innerbucketsize = thisbucketsize;
> if (innermcvfreq > thismcvfreq)
> innermcvfreq = thismcvfreq;
>
> IFAICS, even in the worst case (all columns are totally correlated),
> the overall bucket size should be the smallest bucket size among
> clauses (not the largest). And the same is true of MCV. As a mental
> experiment, we can add a new clause to hash join, which is always true
> because columns on both sides have the same value. In fact, it would
> have almost no influence except for the cost of extracting additional
> columns and the cost of executing additional operators. But in the
> current model, this additional clause would completely ruin
> thisbucketsize and thismcvfreq, making hash join extremely
> unappealing. Should we still revise this to calculate minimum instead
> of maximum?
I agree with your point. But I think the code works precisely the way
you have described.
>
> I've slightly revised the patch. I've run pg_indent and renamed
> s/saveList/origin_rinfos/g for better readability.
Thank You!
>
> Also, the patch badly needs regression test coverage. We can't
> include costs in expected outputs. But that could be some plans,
> which previously were reliably merge joins but now become reliable
> hash joins.
I added one test here. Writing more tests on this feature is hard, but
feature [1] may provide us with additional tools to reveal extended stat
internals. I also have thought about injection points, but it seems an
over-complication.
[1] Showing applied extended statistics in explain Part 2
https://www.postgresql.org/message-id/flat/TYYPR01MB82310B308BA8770838F681619E5E2%40TYYPR01MB8231.jpnprd01.prod.outlook.com
--
regards, Andrei Lepikhov
Attachment | Content-Type | Size |
---|---|---|
v3-0001-Use-extended-statistics-for-precise-estimation-of.patch | text/plain | 12.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Alena Rybakina | 2025-03-03 08:25:54 | Re: making EXPLAIN extensible |
Previous Message | Thomas Munro | 2025-03-03 08:11:00 | Re: Allow io_combine_limit up to 1MB |