Quick Links

Re: Do not scan index in right table if condition for left join evaluates to false using columns in left table

From:	eng eng <pspetrov91(at)gmail(dot)com>
To:	Andrei Lepikhov <lepihov(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>
Cc:	Илья Жарков <izharkov1243(at)gmail(dot)com>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Do not scan index in right table if condition for left join evaluates to false using columns in left table
Date:	2024-12-16 09:58:52
Message-ID:	CAJ-KsAw4EusG+GJCkUnx2SLUvQ0GaydFJzX2EezOYYt3HJpBtA@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

пн, 16 дек. 2024 г. в 12:52, Andrei Lepikhov <lepihov(at)gmail(dot)com>:

> On 12/8/24 06:13, Tom Lane wrote:
> > Andres Freund <andres(at)anarazel(dot)de> writes:
> >> On 2024-12-07 17:06:52 -0500, Tom Lane wrote:
> >>> One could imagine that we split up the join filter conditions into
> >>> "depends on RHS" and "doesn't depend on RHS" subsets, and make the
> >>> nestloop plan node evaluate the latter set only once per LHS row,
> >>> and then skip the inner-side scan when that condition fails.
> >
> >> As I wrote in my other email, I'm also somewhat dubious it's worth
> having
> >> explicit code for this in nodeNestloop.c.
> >
> > Yeah. Your idea of pushing the "doesn't depend on RHS" subset into
> > a one-time Result filter atop the RHS is interesting though. Then we
> > don't need any new executor machinery, but we pay for that with a more
> > complex planner patch. Not sure how hard that would be.
> Well, I have been working on this topic for some time. Let me share some
> experiences and thoughts. They may be helpful.
> I had a user request on this feature, a kind of 'semi-pushdown' clause.
> As production people said, it is quite typical in sales systems to have
> a table of products and additional tables describing product categories.
> Something like that:
>
> CREATE TABLE products (
> id int PRIMARY KEY, payload text DEFAULT 'product',
> price real DEFAULT random(), type text);
> CREATE TABLE vehicles (id int PRIMARY KEY, maxspeed int DEFAULT 250);
> CREATE TABLE phones (id int PRIMARY KEY, screensize real DEFAULT 6.1);
>
> INSERT INTO products (id,type)
> SELECT x,'v' FROM generate_series(1,5E4) AS x;
> INSERT INTO products (id,type)
> SELECT x,'p' FROM generate_series(1+5E4,1E5) AS x;
> INSERT INTO vehicles (id) SELECT x FROM generate_series(1,5E4) AS x;
> INSERT INTO phones (id) SELECT x FROM generate_series(1+5E4,1E5) AS x;
> VACUUM ANALYZE products, vehicles, phones;
>
> They usually need to get top-sales products from different categories
> querying database with something like that:
>
> EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
> SELECT * FROM products p
> LEFT JOIN phones ph ON (p.id = ph.id)
> LEFT JOIN vehicles v ON (p.id = v.id)
> WHERE p.price > 0.9;
>
> Users just want to optimise such queries and get description of
> different products within a single query. The query they wish to execute
> looks like this:
>
> EXPLAIN (ANALYZE, COSTS OFF, TIMING OFF)
> SELECT * FROM products p
> LEFT JOIN phones ph ON (p.id = ph.id AND p.type = 'p')
> LEFT JOIN vehicles v ON (p.id = v.id AND p.type = 'v')
> WHERE p.price > 0.9;
>
> The request was: why not check the RHS-only clause inside a join node
> and save cycles?
>
> To determine how difficult it could be, I wrote a prototype (very raw),
> which shows how it works in action and how many subsystems we have to
> touch.
> Also, I wonder why you think it could work with NestLoop only. I think
> avoiding touching a hash table and an index under MergeJoin can also be
> beneficial.
>
> The patch and reproduction script with resulting EXPLAINs are in the
> attachment. Petr Petrov is doing further work. He may provide additional
> details, if any.
>
>
> --
> regards, Andrei Lepikhov

Hello!

I would like to describe the main idea:

1) We would like to speed up Nested Loop Left Join execution since getting
the next inner tuple could be expensive.

Also, not all outer tuples should be matched. For example, we have 200k
rows and inner tuples are extracted only for 100 of them.

That could allow us to execute Nested Loop Left Join faster.

2) When we analyze RestrictInfo's clauses it is possible to detect whether
it's sufficient to use outer tuple data to calculate them.

If it is the case, then we save them in rhs_joinrinfo. In the current
patch version that was done in create_nestloop_path().

Then in make_nestloop() rhs_joinqual clauses were removed from
joinqual.

Then execute rhs_joinqual in ExecNestLoop() before the next inner tuple
extraction.

If the result of their evaluation is false then we fill inner tuple's
fields with nulls and return the result without visiting the inner
relation.

3) I also propose to add a new type of filter: Outer Tuple Filter.

It's a filter which is evaluated after getting the outer tuple but
before extracting the inner tuple.

Join filter is evaluated after grabbing the inner tuple, that's why I
suggest to differentiate them.

To calculate unmatched tuples by Outer Tuple Filter an additional
counter was added to Instrumentation struct: int nunmatched.

You can find the execution plan in reproduction.sql

The patch was based on commit 3191eccd8a9bff1715f2e4fab86d2932a556185e

At least, make check-world has passed.

That's my first experience in making patches, so apologize if I
misunderstand or explain something poorly.

Looking forward to your questions and feedback.

---
Best regards,
Peter Petrov

Attachment	Content-Type	Size
reproduction.sql	application/sql	2.4 KB
v3-lazy-join.diff	text/x-patch	27.4 KB

In response to

Re: Do not scan index in right table if condition for left join evaluates to false using columns in left table at 2024-12-08 02:23:42 from Andrei Lepikhov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2024-12-16 10:06:33	A few patches to clarify snapshot management
Previous Message	jian he	2024-12-16 09:25:45	Re: Pass ParseState as down to utility functions.