Re: Do not scan index in right table if condition for left join evaluates to false using columns in left table

From: Andrei Lepikhov <lepihov(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Илья Жарков <izharkov1243(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org, p(dot)petrov(at)postgrespro(dot)ru, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Do not scan index in right table if condition for left join evaluates to false using columns in left table
Date: 2024-12-08 02:23:42
Message-ID: 07d035c5-0e40-45ee-9c78-60a5dc6e9fae@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 12/8/24 06:13, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
>> On 2024-12-07 17:06:52 -0500, Tom Lane wrote:
>>> One could imagine that we split up the join filter conditions into
>>> "depends on RHS" and "doesn't depend on RHS" subsets, and make the
>>> nestloop plan node evaluate the latter set only once per LHS row,
>>> and then skip the inner-side scan when that condition fails.
>
>> As I wrote in my other email, I'm also somewhat dubious it's worth having
>> explicit code for this in nodeNestloop.c.
>
> Yeah. Your idea of pushing the "doesn't depend on RHS" subset into
> a one-time Result filter atop the RHS is interesting though. Then we
> don't need any new executor machinery, but we pay for that with a more
> complex planner patch. Not sure how hard that would be.
Well, I have been working on this topic for some time. Let me share some
experiences and thoughts. They may be helpful.
I had a user request on this feature, a kind of 'semi-pushdown' clause.
As production people said, it is quite typical in sales systems to have
a table of products and additional tables describing product categories.
Something like that:

CREATE TABLE products (
id int PRIMARY KEY, payload text DEFAULT 'product',
price real DEFAULT random(), type text);
CREATE TABLE vehicles (id int PRIMARY KEY, maxspeed int DEFAULT 250);
CREATE TABLE phones (id int PRIMARY KEY, screensize real DEFAULT 6.1);

INSERT INTO products (id,type)
SELECT x,'v' FROM generate_series(1,5E4) AS x;
INSERT INTO products (id,type)
SELECT x,'p' FROM generate_series(1+5E4,1E5) AS x;
INSERT INTO vehicles (id) SELECT x FROM generate_series(1,5E4) AS x;
INSERT INTO phones (id) SELECT x FROM generate_series(1+5E4,1E5) AS x;
VACUUM ANALYZE products, vehicles, phones;

They usually need to get top-sales products from different categories
querying database with something like that:

EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT * FROM products p
LEFT JOIN phones ph ON (p.id = ph.id)
LEFT JOIN vehicles v ON (p.id = v.id)
WHERE p.price > 0.9;

Users just want to optimise such queries and get description of
different products within a single query. The query they wish to execute
looks like this:

EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
SELECT * FROM products p
LEFT JOIN phones ph ON (p.id = ph.id AND p.type = 'p')
LEFT JOIN vehicles v ON (p.id = v.id AND p.type = 'v')
WHERE p.price > 0.9;

The request was: why not check the RHS-only clause inside a join node
and save cycles?

To determine how difficult it could be, I wrote a prototype (very raw),
which shows how it works in action and how many subsystems we have to touch.
Also, I wonder why you think it could work with NestLoop only. I think
avoiding touching a hash table and an index under MergeJoin can also be
beneficial.

The patch and reproduction script with resulting EXPLAINs are in the
attachment. Petr Petrov is doing further work. He may provide additional
details, if any.

--
regards, Andrei Lepikhov

Attachment Content-Type Size
reproduction.sql application/sql 2.1 KB
v2-lazy-join.diff text/x-patch 6.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2024-12-08 02:52:20 Re: Do not scan index in right table if condition for left join evaluates to false using columns in left table
Previous Message Tomas Vondra 2024-12-08 00:36:38 Re: [PATCH] Add roman support for to_number function