Re: allowing extensions to control planner behavior

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: allowing extensions to control planner behavior
Date: 2024-09-18 15:48:49
Message-ID: CA+TgmoYadvnRGM7VEqgKbuZ+c+VN3jSDx+_38e66a9yk2fiovA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Aug 27, 2024 at 4:07 PM Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> I think that the beginning of add_paths_to_joinrel() looks like a
> useful spot to get control. You could, for example, have a hook there
> which returns a bitmask indicating which of merge-join, nested-loop,
> and hash join will be allowable for this call; that hook would then
> allow for control over the join method and the join order, and the
> join order control is strong enough that you can implement either of
> the two interpretations above. This idea theorizes that 0001 was wrong
> to make the path mask a per-RelOptInfo value, because there could be
> many calls to add_paths_to_joinrel() for a single RelOptInfo and, in
> this idea, every one of those can enforce a different mask.

Here is an implementation of this idea. I think this is significantly
more elegant than the previous patch. Functionally, it does a better
job allowing for control over join planning than the previous patch,
because you can control both the join method and the join order. It
does not attempt to provide control over scan or appendrel methods; I
could build similar machinery for those cases, but let's talk about
this case, first. As non-for-commit proofs-of-concept, I've included
two sample contrib modules here, one called alphabet_join and one
called hint_via_alias. alphabet_join forces the join to be done in
alphabetical order by table alias. Consider this test query:

SELECT COUNT(*) FROM pgbench_accounts THE
INNER JOIN pgbench_accounts QUICK on THE.aid = QUICK.aid
INNER JOIN pgbench_accounts BROWN on THE.aid = BROWN.aid
INNER JOIN pgbench_accounts FOX on THE.aid = FOX.aid
INNER JOIN pgbench_accounts JUMPED on THE.aid = JUMPED.aid
INNER JOIN pgbench_accounts OVER on THE.aid = OVER.aid
INNER JOIN pgbench_accounts LAZY on THE.aid = LAZY.aid
INNER JOIN pgbench_accounts DOG on THE.aid = DOG.aid;

When you just execute this normally, the join order matches the order
you enter it: THE QUICK BROWN FOX JUMPED OVER LAZY DOG. But if you
load alphabet_join, then the join order becomes BROWN DOG FOX JUMPED
LAZY OVER QUICK THE. It is unlikely that anyone wants their join order
to be determined by strict alphabetical order, but again, this is just
intended to show that the hook works.

hint_via_alias whose table alias starts with mj_, hj_, or nl_ using a
merge-join, hash-join, or nested loop, respectively. Here again, I
don't think that passing hints through the table alias names is
probably the best thing from a UI perspective, but unlike the previous
one which is clearly a toy, I can imagine someone actually trying to
use this one on a real server. If we want anything in contrib at all
it should probably be something much better than this, but again at
this stage I'm just trying to showcase the hook.

> Potentially, such a hook could return additional information, either
> by using more bits of the bitmask or by returning other information
> via some other data type. For instance, I still believe that
> distinguishing between parameterized-nestloops and
> unparameterized-nestloops would be real darn useful, so we could have
> separate bits for each; or you could have a bit to control whether
> foreign-join paths get disabled (or are considered at all), or you
> could have separate bits for merge joins that involve 0, 1, or 2
> sorts. Whether we need or want any or all of that is certainly
> debatable, but the point is that if you did want some of that, or
> something else, it doesn't look difficult to feed that information
> through to the places where you would need it to be available.

I spent a lot of time thinking about what should and should not be in
scope for this hook and decided against both of the ideas above.
They're not necessarily bad ideas but they feel like examples of
arbitrary policy that you could want to implement, and I don't think
it's viable to have every arbitrary policy that someone happens to
favor in the core code. If we want extensions to be able to achieve
these kinds of results, I think we're going to need a hook at either
initial_cost_XXX time that would be free to make arbitrary decisions
about cost and disabled nodes for each possible path we might
generate, or a hook at final_cost_XXX time that could make paths more
disabled (but not less) or more costly (but not less costly unless it
also makes them more disabled). For now, I have not done that, because
I think the hook that I've added to add_paths_to_joinrel is fairly
powerful and significantly cheaper than a hook that has to fire for
every possible generated path. Also, even if we do add that, I bet
it's useful to let this hook pass some state through to that hook, as
a way of avoiding recomputation.

However, although I didn't end up including either of the policies
mentioned above in this patch, I still did end up subdividing the
"merge join" strategy according to whether or not a Materialize node
is used, and the "nested loop" strategy according to whether we use
Materialize, Memoize, or neither. At least according to my reading of
the code, the planner really does consider these to be separate
sub-strategies: it thinks about whether to use a nested loop without a
materialize node, and it thinks about whether to do a nested loop with
a materialize node, and there's separate code for those things. So
this doesn't feel like an arbitrary distinction to me. In contrast,
the parameterized-vs-unparameterized nested loop thing is just a
question of whether the outer path that we happen to choose happens to
satisfy some part of the parameterization of the inner path we happen
to choose; the code neither knows nor cares whether that will occur.
There is also a pragmatic reason to make sure that the hook allows for
control over use of these sub-strategies: pg_hint_plan has Memoize and
NoMemoize hints, and if whatever hook we add here can't replace what
pg_hint_plan is already doing, then it's clearly not up to the mark.

I also spent some time thinking about what behavior this hook does and
does not allow you to control. As noted, it allows you to control the
join method and the join order, including which table ends up on which
side of the join. But, is that good enough to reproduce a very
specific plan, say one that you saw before and liked? Let's imagine
that you've arranged to disable every path in outerrel and innerrel
other than the ones that you want to be chosen, either using some
hackery or some future patch not included here. Now, you want to use
this patch to make sure that those are joined in the way that you want
them to be joined. Can you do that? I think the answer is "mostly".
You'll be able to get the join method you want used, and you'll be
able to get Memoize and/or Materialize nodes if you want them or avoid
them if you don't. Also, join_path_setup_hook will see
JOIN_UNIQUE_INNER or JOIN_UNIQUE_OUTER so if we're thinking of
implementing a semijoin via uniquify+join, the hook will be able to
encourage or discourage that approach if it wants. However, it *won't*
be able to force the uniquification to happen using hashing rather
than sorting or vice versa, or at least not without doing something
pretty kludgy. Also, it won't be able to force a merge join produced
by sort_inner_and_outer() to use the sort keys that it prefers. Merge
joins produced by match_unsorted_outer() are entirely a function of
the input paths, but those produced by sort_inner_and_outer() are not.
Aside from these two cases, I found no other gaps.

AFAICS, the only way to close the gap around unique-ification strategy
would be with some piece of bespoke infrastructure that just does
exactly that. The inability to control the sort order selected by
sort_inner_and_outer() could, I believe, be closed by a hook at
initial or final cost time. As noted above, such a hook is also useful
when you're not trying to arrive at a specific plan, but rather have
some policy around what kind of plan you want to end up with and wish
to penalize plans that don't comply with your policy. So maybe this is
an argument for adding that hook. That said, even without that, this
might be close enough for government work. If the planner chooses the
correct input paths and if it also chooses a merge join, how likely is
it that it will now choose the wrong pathkeys to perform that merge
join? I bet it's quite unlikely, because I think the effect of the
logic we have will probably be to just do the smallest possible amount
of sorting, and that's also probably the answer the user wants. So it
might be OK in practice to just not worry about this. On the other
hand, a leading cause of disasters is people assuming that certain
things would not go wrong and then having those exact things go wrong,
so it's probably unwise to be confident that the attached patch is all
we'll ever need.

Still, I think it's a pretty useful starting point. It is mostly
enough to give you control over join planning, and if combined with
similar work for scan planning, I think it would be enough for
pg_hint_plan. If we also got control over appendrel and agg planning,
then you could do a bunch more cool things.

Comments?

--
Robert Haas
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v3-0003-New-contrib-module-hint_via_alias.patch application/octet-stream 6.0 KB
v3-0002-New-contrib-module-alphabet_join.patch application/octet-stream 4.7 KB
v3-0001-Allow-extensions-to-control-join-strategy.patch application/octet-stream 49.0 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-09-18 15:58:34 Re: Partitioned tables and [un]loggedness
Previous Message Nathan Bossart 2024-09-18 15:17:47 Re: Partitioned tables and [un]loggedness