From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Richard Guo <guofenglinux(at)gmail(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "Gregory Stark (as CFM)" <stark(dot)cfm(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>, Pg Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Emre Hasegeli <emre(at)hasegeli(dot)com> |
Subject: | Re: Using each rel as both outer and inner for JOIN_ANTI |
Date: | 2023-04-06 22:44:20 |
Message-ID: | CA+hUKGJsT_Uag8pr_VDRPiJ-XaMH2DpnsFeR=JXWjP90CP3nCQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Apr 6, 2023 at 6:40 PM Richard Guo <guofenglinux(at)gmail(dot)com> wrote:
> Seems it wins if the parallel scan becomes part of a hash join in final
> plan. I wonder if we have a way to know that in this early stage.
I haven't tried but I'm not sure off the top of my head how to make a
decision that early unless it's super coarse grained, like is there an
anti join in here somewhere...
Generally, this seems to be a consequence of our parallel query
planner design where parallel paths are generated separately and
compete on cost, as foretold by Hong and Stonebraker[1]. It's going
to be hard to prune those early without missing out on some good
plans, if you don't have a crystal ball.
I wondered for a while what horrible technical problems would come up
if we could defer creation of paths, so partial_pathlist is empty but
a new RelOptInfo flag says "you can call
finish_the_partial_pathlist_please(root, rel)) if you really want
one". We could try to be sparing about calling it that so we don't
finish up creating them all. That would at least move the
should-I-bother-to-make-this-path? logic close to the place with the
knowledge that it'd be useful, in this case the inner side of a
parallel hash join. One problem is that you have to get a
parallel_workers number from somewhere to make a partial path. The
hash join path code knows what its own parallel_workers number will be
(it inherits it from the outer path, though I can't immediately think
of any good reason why it shouldn't be Max(inner, outer)), but if we
were to feed that into a created-on-demand inner path that is then
cached, we'd have a horrible ordering dependency (some other join in
the query gets planned first with a different parallel_workers number
and it gets cached differently). Yuck. As you noted, 0 isn't a great
number either, but it leads to a another thought...
I wondered if we could somehow use the complete (non-partial) path
directly in some cases here, if certain conditions are met. Aside
from any other technical problems, you might ask "but what about the
estimates/costing we already did for that path"; well the numbers are
usually wrong anyway! We have complete paths, and partial paths for
arbitrary parallel_workers numbers that bubbled up from our scan
size-based heuristics. Every time we combine more than one partial
path, we have to handle non-matching "parallel degree" (I'm using that
word to mean the result of get_parallel_divisor(path), a lightly
grilled version of parallel_workers that makes dubious assumptions,
but I digress). Concretely, parallel append and parallel hash join
already rescale some of their input variables to match their own
nominal parallel degree (ugh, I see a few things we should fix in that
code, but this email is long enough already). I wonder if we might
need some infrastructure to formalise that sort of thing. For
example, look at the row estimates in the EXPLAIN of parallel append
over tables with parallel_workers explicitly set to different numbers
using ALTER TABLE. They are wrong (they're based on different
parallel degrees; turn off parallel_leader_participation to make the
arithmetic easier to follow), while the append node itself has
rescaled them and has a good row estimate for its own nominal parallel
degree, which in turn might be wrong depending on what is above.
Perhaps EXPLAIN and everything else should use some common
infrastructure to deal with this.
In other words, we avoid the need for a try-every-parallel-degree
explosion by rescaling from some arbitrary input degree to some
arbitrary output degree, but we can't go all the way and do the
two-phase Hong thing and rescale from non-partial paths in general
(for various technical reasons that apply to more complex nodes, but
not to basic scans). From where we are, I'm not sure how much of a
big step it is to (1) formalise the path rescaling system and (2) be
able to rescale some qualifying simple complete paths too, if needed
for places like this.
Of course you could quibble with the concept of linear scaling of
various number by parallel degrees; various things aren't linear or
even continuous (probably why Hong's system included hash join
thresholds). Even the concept of get_parallel_divisor(path) as
applied to row estimates is suspect, because it assumes that the
planned workers will show up; if a smaller number shows up (at all,
due to max_parallel_workers, or just to this node because a higher
level parallel append sent workers to different subplans) then a node
might receive many more input tuples than expected and blow through
work_mem, even if all estimates were 100% perfect. I have no idea
what to do about that. At least *that* problem doesn't apply to
parallel hash, which shares the memory for the number of planned
workers, even if fewer show up, but that ideas isn't without critics
either.
I dunno. Sorry for the wall of text/ideas. I see unfinished business
in every direction.
From | Date | Subject | |
---|---|---|---|
Next Message | Daniel Gustafsson | 2023-04-06 23:08:09 | Re: Should vacuum process config file reload more often |
Previous Message | reid.thompson | 2023-04-06 22:35:38 | Re: Add the ability to limit the amount of memory that can be allocated to backends. |