Re: BUG #16759: Estimation of the planner is wrong for hash join

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Bertrand Guillaumin <bertrand(dot)guillaumin(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #16759: Estimation of the planner is wrong for hash join
Date: 2020-12-18 01:14:54
Message-ID: 18e3e4e6-2816-77fa-84c2-b85448fbe6ce@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 12/17/20 6:36 PM, Bertrand Guillaumin wrote:
> Hello,
> I think I just made a test that shows that even with MCV on both sides
> the estimated selectivity can be pretty wrong.
> test=# create table test_bug_hash2 as SELECT mod(id,500) as id , case
> when id<=500 then 1 else 2 end parent_id, null::text as attrib from
> (select generate_series(1,1000) as id) alias0;
> SELECT 1000
> test=# update test_bug_hash2 set attrib='BEL' where id=2;
> UPDATE 2
> test=# analyze test_bug_hash2;
> ANALYZE
> test=# explain select * from test_bug_hash2 a, test_bug_hash2 b where
> a.parent_id=b.id <http://b.id> and b.attrib='BEL';
>                                   QUERY PLAN
> ------------------------------------------------------------------------------
>  Hash Join  (cost=17.52..37.56 rows=4 width=24)
>    Hash Cond: (a.parent_id = b.id <http://b.id>)
>    ->  Seq Scan on test_bug_hash2 a  (cost=0.00..15.00 rows=1000 width=12)
>    ->  Hash  (cost=17.50..17.50 rows=2 width=12)
>          ->  Seq Scan on test_bug_hash2 b  (cost=0.00..17.50 rows=2
> width=12)
>                Filter: (attrib = 'BEL'::text)
>
> test=#  select count(*) from test_bug_hash2 a, test_bug_hash2 b where
> a.parent_id=b.id <http://b.id> and b.attrib='BEL';
>  count
> -------
>   1000
>
> I won't copy paste the pg_stats lines but most_common_vals and
> most_common_freq have values for all three columns.
>
> I'm not a programmer but I've looked into the code of the planner a
> little bit and it seems you try to estimate the selectivity of a join in
> itself without any regards to the filters that can be applied on any
> side of the join ( if I understood correctly).
> I think that it's ultimately where the problem lies, if the filter is
> not too important the selectivity stays more or less the same but with
> filters like the one in this query the selectivity of the join can
> change a lot so in the end you get estimations which can be totally wrong.
>
>

Right. The problem is that the two columns are correlated, thanks to how
you set the attrib value only for id=2, but the join estimation code is
oblivious to that.

Perhaps the multi-column/extended stats might allow us to improve this,
but the code has not been written yet.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2020-12-18 07:20:36 BUG #16778: Problem: nothing provides libarmadillo.so.9()(64bit) needed by gdal30-libs-3.0.4-2.sles12.x86_64
Previous Message Bertrand Guillaumin 2020-12-17 17:36:45 Re: BUG #16759: Estimation of the planner is wrong for hash join