From: | David Rowley <dgrowleyml(at)gmail(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: disfavoring unparameterized nested loops |
Date: | 2021-06-15 23:59:55 |
Message-ID: | CAApHDvo2sMPF9m=i+YPPUssfTV1GB=Z8nMVa+9Uq4RZJ8sULeQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, 16 Jun 2021 at 05:09, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> How to do that is not very clear. One very simple thing we could do
> would be to introduce enable_nestloop=parameterized or
> enable_parameterized_nestloop=false. That is a pretty blunt instrument
> but the authors of the paper seem to have done that with positive
> results, so maybe it's actually not a dumb idea.
It's not great that people are having to use such blunt instruments to
get the planner to behave. It might not be a terrible idea to provide
them with something a bit sharper as you suggest. The GUC idea is
certainly something that could be done without too much effort.
There was some talk of doing that in [1].
> Another approach
> would be to introduce a large fuzz factor for such nested loops e.g.
> keep them only if the cost estimate is better than the comparable hash
> join plan by at least a factor of N (or if no such plan is being
> generated for some reason).
In my experience, the most common reason that the planner chooses
non-parameterized nested loops wrongly is when there's row
underestimation that says there's just going to be 1 row returned by
some set of joins. The problem often comes when some subsequent join
is planned and the planner sees the given join rel only produces one
row. The cheapest join method we have to join 1 row is Nested Loop.
So the planner just sticks the 1-row join rel on the outer side
thinking the executor will only need to scan the inner side of the
join once. When the outer row count blows up, then we end up scanning
that inner side many more times. The problem is compounded when you
nest it a few joins deep
Most of the time when I see that happen it's down to either the
selectivity of some correlated base-quals being multiplied down to a
number low enough that we clamp the estimate to be 1 row. The other
case is similar, but with join quals.
It seems to me that each time we multiply 2 selectivities together
that the risk of the resulting selectivity being out increases. The
risk is likely lower when we have some extended statistics which
allows us to do fewer selectivity multiplications.
For that 1-row case, doing an unparameterized nested loop is only the
cheapest join method by a tiny amount. It really wouldn't be much
more expensive to just put that single row into a hash table. If that
1 estimated row turns out to be 10 actual rows then it's likely not
too big a deal for the hash join code to accommodate the 9 additional
rows.
This makes me think that it's insanely risky for the planner to be
picking Nested Loop joins in this case. And that puts me down the path
of thinking that this problem should be solved by having the planner
take into account risk as well as costs when generating plans.
I don't really have any concrete ideas on that, but a basic idea that
I have been considering is that a Path has a risk_factor field that is
decided somewhere like clauselist_selectivity(). Maybe the risk can go
up by 1 each time we multiply an individual selectivity. (As of
master, estimate_num_groups() allows the estimation to pass back some
further information to the caller. I added that for Result Cache so I
could allow the caller to get visibility about when the estimate fell
back on DEFAULT_NUM_DISTINCT. clauselist_selectivity() maybe could get
similar treatment to allow the risk_factor or number of nstats_used to
be passed back). We'd then add a GUC, something like
planner_risk_adversion which could be set to anything from 0.0 to some
positive number. During add_path() we could do the cost comparison
like: path1.cost * path1.risk_factor * (1.0 + planner_risk_adversion)
< path2.cost * path2.risk_factor * (1.0 + planner_risk_adversion).
That way, if you set planner_risk_adversion to 0.0, then the planner
does as it does today, i.e takes big risks.
The problem I have with this idea is that I really don't know how to
properly calculate what the risk_factor should be set to. It seems
easy at first to set it to something that has the planner avoid these
silly 1-row estimate nested loop mistakes, but I think what we'd set
the risk_factor to would become much more important when more and more
Path types start using it. So if we did this and just guessed the
risk_factor, that might be fine when only 1 of the paths being
compared had a non-zero risk_factor, but as soon as both paths have
one set, unless they're set to something sensible, then we just end up
comparing garbage costs to garbage costs.
Another common mistake the planner makes is around WHERE a = <value>
ORDER BY b LIMIT n; where there are separate indexes on (a) and (b).
Scanning the (b) index is pretty risky. All the "a" values you need
might be right at the end of the index. It seems safer to scan the (a)
index as we'd likely have statistics that tell us how many rows exist
with <value>. We don't have any stats that tell us where in the (b)
index are all the rows with a = <value>.
I don't really think we should solve this by having nodeNestloop.c
fall back on hashing when the going gets tough. Overloading our nodes
that way is not a sustainable thing to do. I'd rather see the
executor throw the plan back at the planner along with some hints
about what was wrong with it. We could do that providing we've not
sent anything back to the client yet.
David
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2021-06-16 00:11:39 | Re: disfavoring unparameterized nested loops |
Previous Message | Jacob Champion | 2021-06-15 23:50:14 | Re: Support for NSS as a libpq TLS backend |