Re: allowing extensions to control planner behavior

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: allowing extensions to control planner behavior
Date: 2024-10-16 16:53:17
Message-ID: CA+TgmoakMJVbRur2eksQr76-nTkT-H8u9uWqRsZvmvfH=3hYMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 14, 2024 at 6:02 AM Jakub Wartak
<jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
>> I wonder if we could think about reversing the order of operations
>> here and making it so that we do the distinct-ification during parse
>> analysis or maybe early in planning, so that the name you see EXPLAIN
>> print out is the real name of that thing and not just a value invented
>> for display purposes.
>
> This if that's possible?, or simply some counter and numbering the plan operation? or Andrei's response/idea of using hashing??

I spent some time looking at this. I think everything that we might
ever want to do here is possible, but what I unfortunately couldn't
find is a place to do it where it would be more-or-less free.

In parse analysis, we do detect certain kinds of namespace collisions.
For example, "SELECT * FROM x, x" or "SELECT * FROM x, y x" will fail
with ERROR: table name "x" specified more than once. Similarly,
"SELECT * FROM generate_series(1,2), x generate_series" will fail with
ERROR: table name "generate_series" specified more than once, even
though generate_series is not, strictly speaking, a table name. From
this you might infer that each alias can be used only once, but it
turns out that's not entirely correct. If you say "SELECT * FROM x,
q.x" that is totally fine even though both relations end up with the
alias "x". There's a specific exception in the code to allow this
case, when the aliases are the same and but they refer to different
relation OIDs, and this is justified by reference to the SQL standard.
Furthermore, "SELECT * FROM (x JOIN y ON true) x, y" is totally fine
even though there are two x's and two y's. The fact that the
parenthesized part has its own alias makes everything inside the
parentheses a separate scope from the names outside the parentheses,
so there's no name conflict. If you delete the "x" just after the
closing parenthesis then it's all a single scope and you get a
complaint about "y" being used twice. In this example, we're allowed
to have collisions even though there are no subqueries, but it is also
true that each subquery level gets its own scope e.g. "SELECT * FROM
(SELECT * FROM x) x" is permitted.

What I think all this means is that piggy-backing on the existing
logic here doesn't look especially promising. If we're trying to
identify a specific usage of a specific relation inside of a query, we
want global uniqueness across the whole query, but this isn't even
local uniqueness within a certain query level -- it's uniqueness
within a part of a query level with a special carve-out for same-named
tables in different schemas. Just to drive the point home, the
duplicate checks are done with two nested for loops, a good
old-fashioned O(n^2) algorithm, instead of something like entering
everything into a hash table and complaining if the entry already
exists. We're basically relying on the fact that there won't be very
many items at a single level of a FROM-list for this to have
reasonable performance; and scaling it even as far as the whole query
sounds like it might be dubious. Maybe it's possible to rewrite this
somehow to use a hash table and distinguish the cases where it
shouldn't complain, but it doesn't look like a ton of fun.

Once we get past parse analysis, there's really nothing that cares
about duplicate names, as far as I can see. The planner is perfectly
happy to work on multiple instances of the same table or of different
tables with the same alias, and it really has no reason to care about
what anything would look like if it were deparsed and printed out. I
suppose this is why the code works the way it does: by postponing
renaming until somebody actually does EXPLAIN or some other deparsing
operation, we avoid doing it when it isn't "really needed". Aside from
this nudge-the-planner use case there's no real reason to do anything
else.

In terms of solving the problem, nothing prevents us from installing a
hash table in PlannerGlobal and using that to label every
RangeTblEntry (except unnamed joins, probably) with a unique alias. My
tentative idea is to have the hash table key be a string. We'd check
whether the table alias is of the form
string+'_'+positive_integer_without_a_leading_zero. If so, we'd look
up the string in the hash table, else the entire alias. The hash table
entry would tell us which integers were already used for that string,
so then we could arrange to use one that isn't used yet, essentially
re-aliasing tables wherever collisions occur. However, in the "normal"
case where the query plan is not being printed and no
query-plan-frobbing extensions are in use, this effort is all wasted.
Maybe there's a way to do this kind of work only on demand, but it
seems to be somewhat complicated by the way we handle flattening the
range table: initially, every subquery level gets its own range table,
and at the end of planning, those range tables are all collapsed into
one. This means that depending on WHEN we get asked for information
about the re-aliased table names, the information is in a different
place. Uggh.

One could also argue that this sort of functionality doesn't belong in
core at all - that it's the problem of an extension that wants to frob
the planner behavior to figure out how to identify the rels. But I
think that is short-sighted. I think what people will want to do is
have the output of EXPLAIN tell them how they can identify a rel for
frobbing purposes. If not that, then what?

> Even as it stands today, the v4-0002 would be better to have than nothing (well other than pg_hint_plan), as the it looks to me that the most frequent workaround for optimizer issues is to just throw 'enable_nestloop = no' into the mix quite often (so having the ability to just throw fixproblem.so into session_preload_libraries with just strstr()/regex() - to match on specific query - and disable it just there seems to be completely achievable and has much better granularity when targeting whole sessions with SET issued there).

Thanks for the +1.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Geoghegan 2024-10-16 17:14:29 Re: Adding skip scan (including MDAM style range skip scan) to nbtree
Previous Message Tom Lane 2024-10-16 16:41:42 Re: Misleading error "permission denied for table"