Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

From: Quan Zongliang <quanzongliang(at)yeah(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Date: 2023-06-17 00:02:54
Message-ID: 4660403b-8df9-2b44-2214-26b21f35b539@yeah.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2023/6/17 06:46, Tom Lane wrote:
> Quan Zongliang <quanzongliang(at)yeah(dot)net> writes:
>> Perhaps we should discard this (dups cnt > 1) restriction?
>
> That's not going to happen on the basis of one test case that you
> haven't even shown us. The implications of doing it are very unclear.
> In particular, I seem to recall that there are bits of logic that
> depend on the assumption that MCV entries always represent more than
> one row. The nmultiple calculation Tomas referred to may be failing
> because of that, but I'm worried about there being other places.
>

The statistics for the other table look like this:
stadistinct | 6
stanumbers1 | {0.50096667,0.49736667,0.0012}
stavalues1 | {v22,v23,v5}

The value that appears twice in the small table (v1 and v2) does not
appear here. The stadistinct's true value is 18 instead of 6 (three
values in the small table do not appear here).

When calculating the selectivity:
if (nd2 > sslot2->nvalues)
totalsel1 += unmatchfreq1 * otherfreq2 / (nd2 - sslot2->nvalues);

totalsel1 = 0
nd2 = 21
sslot2->nvalues = 2
unmatchfreq1 = 0.99990002016420476
otherfreq2 = 0.82608695328235626

result: totalsel1 = 0.043473913749706022
rows = 0.043473913749706022 * 23 * 2,000,000 = 1999800

> Basically, you're proposing a rather fundamental change in the rules
> by which Postgres has gathered statistics for decades. You need to
> bring some pretty substantial evidence to support that. The burden
> of proof is on you, not on the status quo.
>
> regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2023-06-17 00:38:23 Re: Do we want a hashset type?
Previous Message Tom Lane 2023-06-16 22:46:37 Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics