Improvement of var_eq_non_const()

From: Teodor Sigaev <teodor(at)sigaev(dot)ru>
To: Pgsql Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Improvement of var_eq_non_const()
Date: 2025-02-20 17:05:48
Message-ID: 9789f79b-34f0-49ee-9852-783392a3615c@sigaev.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi!

Comment in var_eq_non_const() says:

/*
* Search is for a value that we do not know a priori, but we will
* assume it is not NULL. Estimate the selectivity as non-null
* fraction divided by number of distinct values, so that we get a
* result averaged over all possible values whether common or
* uncommon. (Essentially, we are assuming that the not-yet-known
* comparison value is equally likely to be any of the possible
* values, regardless of their frequency in the table. Is that a good
* idea?)
*/

Seems, it is workable, but not so good idea, because it could be a reason of
significant underestimation in case of non-uniform data distribution. Often,
this leads to a choice of nested loop instead of other join algorithms and, with
underestimation, planner chooses a wrong plan. In attachment there is a dump and
query from real application, but clipped and obfuscated, sorry.

explain analyze
SELECT
COUNT(*)
FROM
t1
LEFT JOIN
t2
ON t2.b = t1.a
AND t2.c = 0 //line of interest
;

With commented 'line of interest' everything works fine without any problems,
but with that line we will get another plan, much worst. Interesting, that t2.c
column contains only one value (zero). Planner chooses nested loop instead of
hash join and makes a wrong estimation:
Index Only Scan using i1_t2 on t2 (.. rows=6 ..) (.. rows=3555 loops=2000)
Index Cond: ((c = 0) AND (b = t1.a))

It supposed six rows but executor got 3555. That caused because
var_eq_non_const() supposes uniform distribution of t2.b column which is wrong.
Actually, this example works fine anyway, but real query contains a lot other
joins connected to t2 and this planner's mistake becomes a huge performance
issue, up to three orders of magnitude.

I'd like to suggest to improve var_eq_non_const() by using knowledge of MCV and
estimate the selectivity as quadratic mean of non-null fraction divided by
number of distinct values (as it was before) and set of MCV selectivities. Use
quadratic mean because it includes the squared deviation (error) as well and
here it would be nice to compute upper limit of estimation to prevent wrong
choose of nested loop, for example.

If nested loop is forced and patch is applied it will produce much better
estimation:
Index Only Scan using i1_t2 on t2 (.. rows=3380 ..) (.. rows=3555 loops=2000)

--
Teodor Sigaev E-mail: teodor(at)sigaev(dot)ru
WWW: http://www.sigaev.ru/

Attachment Content-Type Size
dump.sql.bz2 application/octet-stream 23.7 KB
v1-0001-Estimate-the-selectivity-as-quadratic-mean-of-non.patch text/x-patch 3.0 KB
e.sql application/sql 289 bytes

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2025-02-20 17:15:36 Re: Proposal: pg_is_volatile function
Previous Message Tomas Vondra 2025-02-20 16:56:02 Re: Parallel CREATE INDEX for GIN indexes