Re: pgsql: Avoid mislabeling of lateral references when pulling up a subque

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: Re: pgsql: Avoid mislabeling of lateral references when pulling up a subque
Date: 2024-12-03 09:15:20
Message-ID: 7d8175ca-82bb-4878-8b25-7edb124fac21@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

On 11/29/24 05:33, Tom Lane wrote:
> Avoid mislabeling of lateral references when pulling up a subquery.
>
> If we are pulling up a subquery that's under an outer join, and
> the subquery's target list contains a strict expression that uses
> both a subquery variable and a lateral-reference variable, it's okay
> to pull up the expression without wrapping it in a PlaceHolderVar.
> That's safe because if the subquery variable is forced to NULL
> by the outer join, the expression result will come out as NULL too,
> so we don't have to force that outcome by evaluating the expression
> below the outer join. It'd be correct to wrap in a PHV, but that can
> lead to very significantly worse plans, since we'd then have to use
> a nestloop plan to pass down the lateral reference to where the
> expression will be evaluated.
Pardon the noise, but I'm curious why the optimiser must choose NestLoop
in the case of lateral reference.

It would be nice to provide alternatives. Because now we have some
corner cases. For example, with pull-up correlated subqueries, we've got
one degraded case. Look the following:

DROP TABLE IF EXISTS t1,t2;
CREATE TABLE t1(x int, y int);
CREATE TABLE t2(x int, y int);
INSERT INTO t1 (x,y)
SELECT gs,-gs FROM generate_series(1,1E4) AS gs;
ANALYZE t1,t2;

EXPLAIN (ANALYZE, COSTS ON)
SELECT t1.* FROM t1 LEFT JOIN LATERAL (
SELECT t3.* FROM t1 t3 WHERE t3.x=t1.x) AS t4
ON (t4.y IN (SELECT y FROM t2 WHERE t4.x=t2.x));

In previous versions Postgres executed this plan in milliseconds:

Hash Left Join
Hash Cond: (t1.x = t3.x)
-> Seq Scan on t1
-> Hash
-> Seq Scan on t1 t3
Filter: (ANY (y = (SubPlan 1).col1))
SubPlan 1
-> Seq Scan on t2
Filter: (t3.x = x)
Planning Time: 0.175 ms
Execution Time: 6.396 ms

But now we have seconds:

Nested Loop Left Join
-> Seq Scan on t1
-> Nested Loop Semi Join
Join Filter: ((t3.x = t2.x) AND (t3.y = t2.y))
-> Seq Scan on t1 t3
Filter: (x = t1.x)
-> Seq Scan on t2
Planning Time: 1.309 ms
Execution Time: 6780.217 ms

Correlated subquery pull-up is a nice optimisation, of course. So, why
not let optimiser try a HashJoin like that (not a really generated
plan, just my imagination):

Hash Left Join
Hash Cond: (t1.x = t3.x)
-> Seq Scan on t1
Hash
-> Nested Loop Semi Join
Join Filter: ((t3.x = t2.x) AND (t3.y = t2.y))
-> Seq Scan on t1 t3
-> Seq Scan on t2

Does the optimiser have some internal limits to let such a path?

--
regards, Andrei Lepikhov

In response to

Responses

Browse pgsql-committers by date

  From Date Subject
Next Message Peter Eisentraut 2024-12-03 13:36:26 pgsql: Fix handling of CREATE DOMAIN with GENERATED constraint syntax
Previous Message Peter Eisentraut 2024-12-03 08:14:34 pgsql: Fix temporary memory leak in system table index scans