Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Quan Zongliang <quanzongliang(at)yeah(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Incorrect estimation of HashJoin rows resulted from inaccurate small table statistics
Date: 2023-06-16 15:39:14
Message-ID: a06caba6-6821-9d6c-d799-8db1dd0228d2@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 6/16/23 11:25, Quan Zongliang wrote:
>
> We have a small table with only 23 rows and 21 values.
>
> The resulting MCV and histogram is as follows
> stanumbers1 | {0.08695652,0.08695652}
> stavalues1  | {v1,v2}
> stavalues2  |
> {v3,v4,v5,v6,v7,v8,v9,v10,v11,v12,v13,v14,v15,v16,v17,v18,v19,v20,v21}
>
> An incorrect number of rows was estimated when HashJoin was done with
> another large table (about 2 million rows).
>
> Hash Join  (cost=1.52..92414.61 rows=2035023 width=0) (actual
> time=1.943..1528.983 rows=3902 loops=1)
>

That's interesting. I wonder how come the estimate gets this bad simply
by skipping values entries with a single row in the sample, which means
we know the per-value selectivity pretty well.

I guess the explanation has to be something strange happening when
estimating the join condition selectivity, where we combine MCVs from
both sides of the join (which has to be happening here, otherwise it
would not matter what gets to the MCV).

It'd be interesting to know what's in the other MCV, and what are the
other statistics for the attributes (ndistinct etc.).

Or even better, a reproducer SQL script that builds two tables and then
joins them.

> The reason is that the MCV of the small table excludes values with rows
> of 1. Put them in the MCV in the statistics to get the correct result.
>
> Using the conservative samplerows <= attstattarget doesn't completely
> solve this problem. It can solve this case.
>
> After modification we get statistics without histogram:
> stanumbers1 | {0.08695652,0.08695652,0.04347826,0.04347826, ... }
> stavalues1  | {v,v2, ... }
>
> And we have the right estimates:
> Hash Join  (cost=1.52..72100.69 rows=3631 width=0) (actual
> time=1.447..1268.385 rows=3902 loops=1)
>

I'm not against building a "complete" MCV, but I guess the case where
(samplerows <= num_mcv) is pretty rare. Why shouldn't we make the MCV
complete whenever we decide (ndistinct <= num_mcv)?

That would need to happen later, because we don't have the ndistinct
estimate yet at this point - we'd have to do the loop a bit later (or
likely twice).

FWIW the patch breaks the calculation of nmultiple (and thus likely the
ndistinct estimate).

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Joel Jacobson 2023-06-16 15:42:14 Re: Do we want a hashset type?
Previous Message Jelte Fennema 2023-06-16 15:27:52 Re: Deleting prepared statements from libpq.