Re: Removing unneeded self joins

From: Alena Rybakina <a(dot)rybakina(at)postgrespro(dot)ru>
To: Alexander Korotkov <aekorotkov(at)gmail(dot)com>, Andrei Lepikhov <lepihov(at)gmail(dot)com>
Cc: Richard Guo <guofenglinux(at)gmail(dot)com>, Ranier Vilela <ranier(dot)vf(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, "Gregory Stark (as CFM)" <stark(dot)cfm(at)gmail(dot)com>, Michał Kłeczek <michal(at)kleczek(dot)org>, Dean Rasheed <dean(dot)a(dot)rasheed(at)gmail(dot)com>
Subject: Re: Removing unneeded self joins
Date: 2025-04-05 21:01:57
Message-ID: 5309f032-e2fc-4913-a5d6-a2a2b5e22b29@postgrespro.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05.04.2025 13:09, Alexander Korotkov wrote:
> On Fri, Apr 4, 2025 at 11:35 AM Andrei Lepikhov<lepihov(at)gmail(dot)com> wrote:
>> On 4/4/25 04:53, Richard Guo wrote:
>>> On Fri, Apr 4, 2025 at 1:02 AM Alexander Korotkov<aekorotkov(at)gmail(dot)com> wrote:
>>>> I've got an off-list bug report from Alexander Lakhin involving a
>>>> placeholder variable. Alena and Andrei proposed a fix. It is fairly
>>>> simple: we just shouldn't remove PHVs during self-join elimination, as
>>>> they might still be referenced from other parts of a query. The patch
>>>> is attached. I'm going to fix this if no objections.
>>> Hmm, I'm not sure about the fix. It seems to me that it simply
>>> prevents removing any PHVs in the self-join removal case. My concern
>>> is that this might result in PHVs that could actually be removed not
>>> being removed in many cases.
>> Let's play with use cases:
>> If a PHV is needed in the inner or outer only, it means we have a clause
>> in the baserestrictinfo that will be transferred to the keeping
>> relation, and we shouldn't remove the PHV.
>> Another case is when the PHV is needed in a join clause of the
>> self-join. I may imagine such a case:
>>
>> toKeep.x+toRemove.y=PHV
>>
>> This clause will be transformed to "toKeep.x+toKeep.y=PHV", pushed to
>> baserestrictinfo of keeping relation and should be saved.
>> I think it is possible to invent quite a narrow case of clause like the
>> following:
>>
>> PHV_evaluated_at_inner = PHV_evaluated_at_outer
>>
>> It needs to prove reproducibility. But even if it makes sense, it seems
>> to have no danger for further selectivity estimation compared to the
>> source clause and is a too-narrow case, isn't it?
>> In other cases, this PHV is needed something else, and we can't remove it.
I agree with that, and I also believe this case will be quite rare in
practice.
> Should we add more regression tests covering these cases?
I experimented with some examples like this and noticed that it does
affect cardinality estimation, though I'm not sure the impact is
significant.
I used the tables from the regression tests, so if they’re appropriate
for reproducing this case, it should be straightforward to add them.

EXPLAIN (analyze, VERBOSE)
SELECT 1
FROM tbl_phv t1
LEFT JOIN (
    SELECT 1 AS extra, x, y FROM tbl_phv tl
) t3
JOIN (
    SELECT y, x FROM tbl_phv tr
) t4 ON t4.y = t3.y and (t4.x*0) = (t3.x *0)
ON true
WHERE t3.extra IS NOT NULL
  AND t4.x + t3.y= (t1.x * 0);

The query plan with transformation:
------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3.27..7.16 rows=100 width=4) (actual
time=0.149..0.151 rows=0.00 loops=1)
   Output: 1
   Hash Cond: ((tr.x + tr.y) = (t1.x * 0))
   Buffers: shared hit=2
   ->  Seq Scan on public.tbl_phv tr  (cost=0.00..2.26 rows=100
width=8) (actual time=0.027..0.047 rows=100.00 loops=1)
         Output: tr.x, tr.y
         Filter: ((1 IS NOT NULL) AND ((tr.x * 0) IS NOT NULL))
         Rows Removed by Filter: 1
         Buffers: shared hit=1
   ->  Hash  (cost=2.01..2.01 rows=101 width=4) (actual
time=0.078..0.079 rows=100.00 loops=1)
         Output: t1.x
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         Buffers: shared hit=1
         ->  Seq Scan on public.tbl_phv t1  (cost=0.00..2.01 rows=101
width=4) (actual time=0.005..0.038 rows=101.00 loops=1)
               Output: t1.x
               Buffers: shared hit=1
 Planning Time: 0.607 ms
 Execution Time: 0.194 ms
(18 rows)

The query plan without transformation:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=6.09..8.74 rows=1 width=4) (actual time=0.248..0.250
rows=0.00 loops=1)
Output:1
Hash Cond:((t1.x * 0) = (tr.x + tl.y))
Buffers:shared hit=3
-> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4)
(actual time=0.011..0.021 rows=101.00 loops=1)
Output:t1.x, t1.y
Buffers:shared hit=1
-> Hash (cost=6.08..6.08 rows=1 width=8) (actual time=0.206..0.207
rows=100.00 loops=1)
Output:tl.y, tr.x
Buckets:1024 Batches:1 Memory Usage:12kB
Buffers:shared hit=2
-> Hash Join (cost=3.51..6.08 rows=1 width=8) (actual time=0.090..0.160
rows=100.00 loops=1)
Output:tl.y, tr.x
Inner Unique:true
Hash Cond:((tr.y = tl.y) AND ((tr.x * 0) = (tl.x * 0)))
Buffers:shared hit=2
-> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=101 width=8)
(actual time=0.005..0.016 rows=101.00 loops=1)
Output:tr.x, tr.y
Buffers:shared hit=1
-> Hash (cost=2.01..2.01 rows=100 width=8) (actual time=0.080..0.080
rows=100.00 loops=1)
Output:tl.y, tl.x
Buckets:1024 Batches:1 Memory Usage:12kB
Buffers:shared hit=1
-> Seq Scan on public.tbl_phv tl (cost=0.00..2.01 rows=100 width=8)
(actual time=0.008..0.035 rows=101.00 loops=1)
Output:tl.y, tl.x
Filter:(1 IS NOT NULL)
Buffers:shared hit=1
Planning:
Buffers:shared hit=25
Planning Time:0.609 ms
Execution Time:0.319 ms
(31 rows)

EXPLAIN (analyze, VERBOSE)
SELECT 1
FROM tbl_phv t1
LEFT JOIN (
    SELECT 1 AS extra, x, y FROM tbl_phv tl
) t3
JOIN (
    SELECT y, x FROM tbl_phv tr
) t4 ON t4.y = t3.y
ON true
WHERE t3.extra IS NOT NULL
  AND t4.x + t3.y= (t1.x * 0);

The query plan with transformation:
------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=3.26..6.90 rows=100 width=4) (actual
time=0.145..0.146 rows=0.00 loops=1)
   Output: 1
   Hash Cond: ((t1.x * 0) = (tr.x + tr.y))
   Buffers: shared hit=2
   ->  Seq Scan on public.tbl_phv t1  (cost=0.00..2.01 rows=101
width=4) (actual time=0.030..0.040 rows=101.00 loops=1)
         Output: t1.x, t1.y
         Buffers: shared hit=1
   ->  Hash  (cost=2.01..2.01 rows=100 width=8) (actual
time=0.081..0.081 rows=100.00 loops=1)
         Output: tr.y, tr.x
         Buckets: 1024  Batches: 1  Memory Usage: 12kB
         Buffers: shared hit=1
         ->  Seq Scan on public.tbl_phv tr  (cost=0.00..2.01 rows=100
width=8) (actual time=0.012..0.040 rows=101.00 loops=1)
               Output: tr.y, tr.x
               Filter: (1 IS NOT NULL)
               Buffers: shared hit=1
 Planning Time: 0.571 ms
 Execution Time: 0.195 ms
(17 rows)

The query plan without transformation:

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=6.53..9.94 rows=50 width=4) (actual time=0.236..0.238
rows=0.00 loops=1)
Output:1
Hash Cond:((tr.x + tl.y) = (t1.x * 0))
Buffers:shared hit=3
-> Hash Join (cost=3.26..5.55 rows=100 width=8) (actual
time=0.075..0.137 rows=101.00 loops=1)
Output:tl.y, tr.x
Inner Unique:true
Hash Cond:(tr.y = tl.y)
Buffers:shared hit=2
-> Seq Scan on public.tbl_phv tr (cost=0.00..2.01 rows=101 width=8)
(actual time=0.005..0.017 rows=101.00 loops=1)
Output:tr.x, tr.y
Buffers:shared hit=1
-> Hash (cost=2.01..2.01 rows=100 width=4) (actual time=0.064..0.065
rows=101.00 loops=1)
Output:tl.y
Buckets:1024 Batches:1 Memory Usage:12kB
Buffers:shared hit=1
-> Seq Scan on public.tbl_phv tl (cost=0.00..2.01 rows=100 width=4)
(actual time=0.006..0.033 rows=101.00 loops=1)
Output:tl.y
Filter:(1 IS NOT NULL)
Buffers:shared hit=1
-> Hash (cost=2.01..2.01 rows=101 width=4) (actual time=0.078..0.078
rows=100.00 loops=1)
Output:t1.x
Buckets:1024 Batches:1 Memory Usage:12kB
Buffers:shared hit=1
-> Seq Scan on public.tbl_phv t1 (cost=0.00..2.01 rows=101 width=4)
(actual time=0.014..0.039 rows=101.00 loops=1)
Output:t1.x
Buffers:shared hit=1
Planning:
Buffers:shared hit=12
Planning Time:0.478 ms
Execution Time:0.293 ms
(31 rows)

--
Regards,
Alena Rybakina
Postgres Professional

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andreas Karlsson 2025-04-05 22:14:52 Re: dblink query interruptibility
Previous Message Shayon Mukherjee 2025-04-05 21:00:59 Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX