Re: BUG #17964: Missed query planner optimization

From: Mathias Kunter <mathiaskunter(at)gmail(dot)com>
To: David Rowley <dgrowleyml(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17964: Missed query planner optimization
Date: 2023-06-16 12:56:07
Message-ID: a29936a0-e446-d4d1-6bd1-42d7212c26a0@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> We'd need a way to determine which is cheaper based on estimated costs

Yes, that decision must of course be based on estimated costs, and also
the cost estimation must be reasonably accurate. In all examples which
I've seen so far, the cost estimations were adequate.

> that might be, in my opinion, fairly tricky to do based on the order of
> how things currently are done in the query planner.

I don't know the details about the current planner implementation, but
the planner obviously can already correctly estimate costs for both the
IN and ANY(ARRAY()) variants.

> I was just trying to show that it's quite a tricky problem to
> fix properly.

Yes, the tricky part obviously is to decide whether or not an IN clause
should be converted to a semi-join, because that decision depends on the
other IN clauses of the query as well. So, it's unfortunately not
possible to decide independently for each IN clause.

As you've already pointed out, doing an exhaustive search of all
possible combinations is obviously problematic, since we'd have to check
2^N possible query plans (with N being the number of IN clauses of the
query). So, this is certainly NOT the way to go, unless maybe N<4 or so.

> Having the planner consider the costs of converting the IN to a semi-
> join and converting or not based on cost seems like a worthy goal.

It obviously is. As shown, the performance improvement can be gigantic
for certain queries.

> If you think it's quite an easy project, then we'll welcome
> contributions to improve this.

I'm not familiar with the current implementation, but my understanding
is that this should be quite easy to implement for someone who is
currently working on planner-related issues.

After all, a reasonable heuristic solution can be as simple as this:

1) Create the query plan for the original query, using the current
planner implementation. (That's the status quo.)

2) Create a modified version of the original input query where all of
the IN clauses are replaced with ANY(ARRAY()). Create the query plan for
that modified query version, using the current planner implementation.

3) Compare the two previously created query plans and execute the one
with the lower estimated total cost. If you don't trust the planner's
cost estimations for some reason, then only execute the modified version
if its estimated cost is SIGNIFICANTLY lower.

A simple solution like this would already be good enough to avoid many
of the horrifying worst case scenarios, where a query takes minutes when
it should actually only take milliseconds.

It would of course be possible (and desirable) to come up with a more
sophisticated heuristic than the one given above. For example, create a
third version of the original query where only those IN clauses are
replaced with ANY(ARRAY()) where the estimated number of rows returned
by the subselect is below a certain threshold, and estimate costs for
that modified query as well.

> The pgsql-hackers as a good emailing list to bring up the topic

Thank you for the hint. I'll continue the discussion over there.

Mathias

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Jelte Fennema 2023-06-16 13:00:38 Re: Internal error with types changes and prepared statements
Previous Message Junwang Zhao 2023-06-16 08:13:39 Re: No -d option