Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs

From: Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Subject: Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs
Date: 2023-07-06 13:51:48
Message-ID: 148ff8f1-067b-1409-c754-af6117de9b7d@yandex.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, all!

On 26.06.2023 12:22, Andrey Lepikhov wrote:
> On 24/6/2023 17:23, Tomas Vondra wrote:
>> I really hope what I just wrote makes at least a little bit of sense.
> Throw in one more example:
>
> SELECT i AS id INTO l FROM generate_series(1,100000) i;
> CREATE TABLE r (id int8, v text);
> INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
> ANALYZE l,r;
> EXPLAIN ANALYZE
> SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
>
> Here you can see the same kind of underestimation:
> Hash Left Join  (... rows=500 width=14) (... rows=99999 ...)
>
> So the eqjoinsel_unmatch_left() function should be modified for the
> case where nd1<nd2.
>
>
> Unfortunately, this patch could not fix the cardinality calculation in
> this request, I'll try to look and figure out what is missing here.

I tried to fix the cardinality score in the query above by changing:

diff --git a/src/backend/utils/adt/selfuncs.c
b/src/backend/utils/adt/selfuncs.c
index 8e18aa1dd2b..40901836146 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
                 * if we're calculating fraction of NULLs or fraction
of unmatched rows.
                 */
                // unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
-               if (nd1 > nd2)
+               if (nd1 != nd2)
                {
-                       selec /= nd1;
-                       *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
+                       selec /= Max(nd1, nd2);
+                       *unmatched_frac = abs(nd1 - nd2) * 1.0 /
Max(nd1, nd2);
                }
+               /*if (nd1 > nd2)
+               {
+                       selec /= nd1;
+                       *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
+               }*/
                else
                {
                        selec /= nd2;

and it worked:

SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR:  relation "l" already exists
ERROR:  relation "r" already exists
INSERT 0 2
ANALYZE
                                                  QUERY PLAN
---------------------------------------------------------------------------------------------------------------
 Hash Left Join  (cost=1.09..1944.13 rows=99998 width=14) (actual
time=0.152..84.184 rows=99999 loops=1)
   Hash Cond: (l.id = r.id)
   Filter: (r.v IS NULL)
   Rows Removed by Filter: 2
   ->  Seq Scan on l  (cost=0.00..1443.00 rows=100000 width=4) (actual
time=0.040..27.635 rows=100000 loops=1)
   ->  Hash  (cost=1.04..1.04 rows=4 width=10) (actual
time=0.020..0.022 rows=4 loops=1)
         Buckets: 1024  Batches: 1  Memory Usage: 9kB
         ->  Seq Scan on r  (cost=0.00..1.04 rows=4 width=10) (actual
time=0.009..0.011 rows=4 loops=1)
 Planning Time: 0.954 ms
 Execution Time: 92.309 ms
(10 rows)

It looks too simple and I suspect that I might have missed something
somewhere, but so far I haven't found any examples of queries where it
doesn't work.

I didn't see it breaking anything in the examples from my previous
letter [1].

1.
https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru

Unfortunately, I can't understand your idea from point 4, please explain it?

The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...

1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).

2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.

3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.

4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.

--
Regards,
Alena Rybakina
Postgres Professional

Attachment Content-Type Size
0001-Fixed-the-case-of-calculating-underestimated-cardina.patch text/x-patch 4.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2023-07-06 14:02:44 Re: UUID v7
Previous Message Daniel Gustafsson 2023-07-06 13:50:57 Re: pg_basebackup check vs Windows file path limits