On 3/11/2023 23:43, Tomas Vondra wrote:
> On 9/11/23 10:04, Lepikhov Andrei wrote:
> * Determine bucketsize fraction and MCV frequency for the inner
> * relation. We use the smallest bucketsize or MCV frequency estimated
> * for any individual hashclause; this is undoubtedly conservative.
>
> I'm sure this may lead to inflated cost for "good" cases (where the
> actual bucket size really is a product), which may push the optimizer to
> use the less efficient/slower join method.
Yes, It was contradictory idea, though.
> IMHO the only principled way forward is to get a better ndistinct
> estimate (which this implicitly does), perhaps by using extended
> statistics. I haven't tried, but I guess it'd need to extract the
> clauses for the inner side, and call estimate_num_groups() on it.
And I've done it. Sorry for so long response. This patch employs of
extended statistics for estimation of the HashJoin bucket_size. In
addition, I describe the idea in more convenient form here [1].
Obviously, it needs the only ndistinct to make a prediction that allows
to reduce computational cost of this statistic.
[1]
https://open.substack.com/pub/danolivo/p/why-postgresql-prefers-mergejoin?r=34q1yy&utm_campaign=post&utm_medium=web
--
regards, Andrei Lepikhov