From: | Alena Rybakina <lena(dot)ribackina(at)yandex(dot)ru> |
---|---|
To: | Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Andrey Lepikhov <a(dot)lepikhov(at)postgrespro(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Subject: | Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs |
Date: | 2023-07-08 08:29:52 |
Message-ID: | 866a4673-cf0d-c087-96a6-332e4da24ef8@yandex.ru |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
> Well, one option would be to modify all selectivity functions to do
> something like the patch does for nulltestsel(). That seems a bit
> cumbersome because why should those places care about maybe running on
> the outer side of a join, or what? For code in extensions this would be
> particularly problematic, I think.
Agree. I would say that we can try it if nothing else works out.
> So what I was thinking about doing this in a way that'd make this
> automatic, without having to modify the selectivity functions.
>
> Option (3) is very simple - examine_variable would simply adjust the
> statistics by tweaking the null_frac field, when looking at variables on
> the outer side of the join. But it has issues when estimating multiple
> conditions.
>
> Imagine t1 has 1M rows, and we want to estimate
>
> SELECT * FROM t1 LEFT JOIN t2 ON (t1.id = t2.id)
> WHERE ((t2.a=1) AND (t2.b=1))
>
> but only 50% of the t1 rows has a match in t2. Assume each of the t2
> conditions matches 100% rows in the table. With the correction, this
> means 50% selectivity for each condition. And if we combine them the
> usual way, it's 0.5 * 0.5 = 0.25.
>
> But we know all the rows in the "matching" part match the condition, so
> the correct selectivity should be 0.5.
>
> In a way, this is just another case of estimation issues due to the
> assumption of independence.
> FWIW, I used "AND" in the example for simplicity, but that'd probably be
> pushed to the baserel level. There'd need to be OR to keep it at the
> join level, but the overall issue is the same, I think.
>
> Also, this entirely ignores extended statistics - I have no idea how we
> might tweak those in (3).
I understood the idea - it is very similar to what is implemented in the
current patch.
But I don't understand how to do it in the examine_variable function, to
be honest.
> But (4) was suggesting we could improve this essentially by treating the
> join as two distinct sets of rows
>
> - the inner join result
>
> - rows without match on the outer side
>
> For the inner part, we would do estimates as now (using the regular
> per-column statistics). If we knew the conditions match 100% rows, we'd
> still get 100% when the conditions are combined.
>
> For the second part of the join we know the outer side is just NULLs in
> all columns, and that'd make the estimation much simpler for most
> clauses. We'd just need to have "fake" statistics with null_frac=1.0 and
> that's it.
>
> And then we'd just combine these two selectivities. If we know the inner
> side is 50% and all rows match the conditions, and no rows in the other
> 50% match, the selectivity is 50%.
>
> inner_part * inner_sel + outer_part * outer_sel = 0.5 * 1.0 + 0.0 = 0.5
>
> Now, we still have issues with independence assumption in each of these
> parts separately. But that's OK, I think.
>
> I think (4) could be implemented by doing the current estimation for the
> inner part, and by tweaking examine_variable in the "outer" part in a
> way similar to (3). Except that it just sets null_frac=1.0 everywhere.
>
> For (4) we don't need to tweak those at all,
> because for inner part we can just apply them as is, and for outer part
> it's irrelevant because everything is NULL.
I like this idea the most) I'll try to start with this and implement the
patch.
> I hope this makes more sense. If not, let me know and I'll try to
> explain it better.
Thank you for your explanation)
I will unsubscribe soon based on the results or if I have any questions.
--
Regards,
Alena Rybakina
Postgres Professional
From | Date | Subject | |
---|---|---|---|
Next Message | Tomas Vondra | 2023-07-08 09:10:26 | Re: Problems with estimating OR conditions, IS NULL on LEFT JOINs |
Previous Message | Kim Johan Andersson | 2023-07-08 06:11:10 | Re: [PATCH] Add support function for containment operators |