From: | Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs |
Date: | 2023-07-06 13:51:48 |
Message-ID: | 148ff8f1-067b-1409-c754-af6117de9b7d@yandex.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, all!
On 26.06.2023 12:22, Andrey Lepikhov wrote:
> On 24/6/2023 17:23, Tomas Vondra wrote:
>> I really hope what I just wrote makes at least a little bit of sense.
> Throw in one more example:
>
> SELECT i AS id INTO l FROM generate_series(1,100000) i;
> CREATE TABLE r (id int8, v text);
> INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
> ANALYZE l,r;
> EXPLAIN ANALYZE
> SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
>
> Here you can see the same kind of underestimation:
> Hash Left Join (... rows=500 width=14) (... rows=99999 ...)
>
> So the eqjoinsel_unmatch_left() function should be modified for the
> case where nd1<nd2.
>
>
> Unfortunately, this patch could not fix the cardinality calculation in
> this request, I'll try to look and figure out what is missing here.
I tried to fix the cardinality score in the query above by changing:
diff --git a/src/backend/utils/adt/selfuncs.c
b/src/backend/utils/adt/selfuncs.c
index 8e18aa1dd2b..40901836146 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -2604,11 +2604,16 @@ eqjoinsel_inner(Oid opfuncoid, Oid collation,
* if we're calculating fraction of NULLs or fraction
of unmatched rows.
*/
// unmatchfreq = (1.0 - nullfrac1) * (1.0 - nullfrac2);
- if (nd1 > nd2)
+ if (nd1 != nd2)
{
- selec /= nd1;
- *unmatched_frac = (nd1 - nd2) * 1.0 / nd1;
+ selec /= Max(nd1, nd2);
+ *unmatched_frac = abs(nd1 - nd2) * 1.0 /
Max(nd1, nd2);
}
+ /*if (nd1 > nd2)
+ {
+ selec /= nd1;
+ *unmatched_frac = nd1 - nd2 * 1.0 / nd1;
+ }*/
else
{
selec /= nd2;
and it worked:
SELECT i AS id INTO l FROM generate_series(1,100000) i;
CREATE TABLE r (id int8, v text);
INSERT INTO r (id, v) VALUES (1, 't'), (-1, 'f');
ANALYZE l,r;
EXPLAIN ANALYZE
SELECT * FROM l LEFT OUTER JOIN r ON (r.id = l.id) WHERE r.v IS NULL;
ERROR: relation "l" already exists
ERROR: relation "r" already exists
INSERT 0 2
ANALYZE
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Hash Left Join (cost=1.09..1944.13 rows=99998 width=14) (actual
time=0.152..84.184 rows=99999 loops=1)
Hash Cond: (l.id = r.id)
Filter: (r.v IS NULL)
Rows Removed by Filter: 2
-> Seq Scan on l (cost=0.00..1443.00 rows=100000 width=4) (actual
time=0.040..27.635 rows=100000 loops=1)
-> Hash (cost=1.04..1.04 rows=4 width=10) (actual
time=0.020..0.022 rows=4 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Seq Scan on r (cost=0.00..1.04 rows=4 width=10) (actual
time=0.009..0.011 rows=4 loops=1)
Planning Time: 0.954 ms
Execution Time: 92.309 ms
(10 rows)
It looks too simple and I suspect that I might have missed something
somewhere, but so far I haven't found any examples of queries where it
doesn't work.
I didn't see it breaking anything in the examples from my previous
letter [1].
1.
https://www.postgresql.org/message-id/7af1464e-2e24-cfb1-b6d4-1544757f8cfa%40yandex.ru
Unfortunately, I can't understand your idea from point 4, please explain it?
The good thing is this helps even for IS NULL checks on non-join-key
columns (where we don't switch to an antijoin), but there's a couple
things that I dislike ...
1) It's not restricted to outer joins or anything like that (this is
mostly just my laziness / interest in one particular query, but also
something the outer-join-aware patch might help with).
2) We probably don't want to pass this kind of information through
sjinfo. That was the simplest thing for an experimental patch, but I
suspect it's not the only piece of information we may need to pass to
the lower levels of estimation code.
3) I kinda doubt we actually want to move this responsibility (to
consider fraction of unmatched rows) to the low-level estimation
routines (e.g. nulltestsel and various others). AFAICS this just
"introduces NULLs" into the relation, so maybe we could "adjust" the
attribute statistics (in examine_variable?) by inflating null_frac and
modifying the other frequencies in MCV/histogram.
4) But I'm not sure we actually want to do that in these low-level
selectivity functions. The outer join essentially produces output with
two subsets - one with matches on the outer side, one without them. But
the side without matches has NULLs in all columns. In a way, we know
exactly how are these columns correlated - if we do the usual estimation
(even with the null_frac adjusted), we just throw this information away.
And when there's a lot of rows without a match, that seems bad.
--
Regards,
Alena Rybakina
Postgres Professional
Attachment | Content-Type | Size |
---|---|---|
0001-Fixed-the-case-of-calculating-underestimated-cardina.patch | text/x-patch | 4.5 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2023-07-06 14:02:44 | Re: UUID v7 |
Previous Message | Daniel Gustafsson | 2023-07-06 13:50:57 | Re: pg_basebackup check vs Windows file path limits |