Re: Adjust ndistinct for eqjoinsel

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Zhenghua Lyu <zlyu(at)vmware(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Adjust ndistinct for eqjoinsel
Date: 2022-07-15 15:56:41
Message-ID: 3394045.1657900601@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Zhenghua Lyu <zlyu(at)vmware(dot)com> writes:
> I run TPC-DS benchmark for Postgres and find the join size estimation has several problems.
> For example, Ndistinct is key to join selectivity's estimation, this value does not take restrictions
> of the rel, I hit some cases in the function eqjoinsel, nd is much larger than vardata.rel->rows.

> Accurate estimation need good math model that considering dependency of join var and vars in restriction.
> But at least, indistinct should not be greater than the number of rows.

> See the attached patch to adjust nd in eqjoinsel.

We're very unlikely to accept this with no test case and no explanation
of why it's not an overcorrection. get_variable_numdistinct already
clamps its result to rel->tuples, and I think that by using rel->rows
instead you are probably double-counting the selectivity of the rel's
restriction clauses.

See the sad history of commit 7f3eba30c, which did something
pretty close to this and eventually got almost entirely reverted
(97930cf57, 0d3b231ee). I'd be the first to agree that better
estimates here would be great, but it's not as simple as it looks.

regards, tom lane

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2022-07-15 15:59:41 Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)
Previous Message Stephen Frost 2022-07-15 15:17:24 Re: Add function to return backup_label and tablespace_map