Re: Performance issues related to left join and order by

From: David Rowley <dgrowleyml(at)gmail(dot)com>
To: "Liu, Xinyu" <liuxy(at)gatech(dot)edu>
Cc: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Performance issues related to left join and order by
Date: 2021-03-02 10:26:26
Message-ID: CAApHDvqxFMOu-2M0mGsYLXt_b5Ptqti8ceMkeOePhwrULF4vXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On Tue, 2 Mar 2021 at 21:53, Liu, Xinyu <liuxy(at)gatech(dot)edu> wrote:
> *Expected Behavior
>
> Since these two queries are semantically equivalent, we were hoping that PostgreSQL would evaluate them in roughly the same amount of time.
> It looks to me that there is a missing optimization rule related to pushing the sort operator (i.e., order and limit) through the left join.
> Given the significant query execution time difference, I was wondering if it is worth adding such a rule to make the system evaluate the first query more efficiently.
> It would also be helpful if you could comment on if there is a standard practice to evaluate the tradeoff associated with adding such a rule in Postgresql.

We currently don't attempt to push down LIMIT clauses into subqueries.
Before we did that we'd need to get much better at figuring out how
joins duplicate rows so that we could be sure that we're not limiting
the subquery more than the number of records that the outer query will
need to reach its limit.

If you want some advice, you're likely to get more people on your side
and possible support for making improvements to the query planner if
you provide examples that look remotely like real-world queries. In
the other emails that I've read from you on this list [1], it seems
you're example queries are all completely bogus. I suspect that the
queries are generated by some fuzz testing tool. I very much imagine
that really don't need help with these at all. With respect, it seems
to me that there's about zero chance that you genuinely need the
results of this query more quickly and you've come for help with that.

Because PostgreSQL does not proactively cache query plans, ad-hoc
queries are always parsed, planned then executed. This means that
it's often not practical to spend excessive amounts of time planning a
query that gets executed just once. Adding new optimisations to the
query planner means they either have to be very cheap to detect, or
they must pay off in many cases.

If you happen to think there's a genuine case for having the query
planner do a better job of doing LIMIT pushdowns into subqueries, then
you're welcome to submit a patch to implement that. You'll also need
to carefully document exactly which cases the LIMIT can be pushed down
and when it cannot. That's the hard part. The actual pushing down of
the clause is dead easy. If you're going to do that, then I'd suggest
you come up with better examples than this one. I don't think many
people will get on board with your newly proposed optimisations when
the queries are obviously not real. It's hard to imagine the
optimisation being useful to any queries with a query that's so
obviously not a real one.

David

[1] https://www.postgresql.org/message-id/BN7PR07MB52024B973EAB075F4DF6C19ACD999%40BN7PR07MB5202.namprd07.prod.outlook.com

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Rodriguez Pablo A 2021-03-02 11:58:51 High availability management tool.
Previous Message David Rowley 2021-03-02 09:49:20 Re: Potential performance issues related to group by and covering index