BUG #18152: Join condition is not pushed down to union all subquery

From: PG Bug reporting form <noreply(at)postgresql(dot)org>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc: lauri(dot)kajan(at)gmail(dot)com
Subject: BUG #18152: Join condition is not pushed down to union all subquery
Date: 2023-10-10 06:26:26
Message-ID: 18152-72d0fe42ca4283d9@postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 18152
Logged by: Lauri Kajan
Email address: lauri(dot)kajan(at)gmail(dot)com
PostgreSQL version: 16.0
Operating system: &quot;PostgreSQL 16.0 (Debian 16.0-1.pgdg110+1) on x86_
Description:

The join condition is not pushed down to a subquery containing UNION ALL
when even a single query within the subquery contains a WHERE clause. This
issue prevents the optimizer from creating a plan that utilizes the index on
the join column. However, when the WHERE clause is moved to the top-level
query, the optimizer produces the desired plan.

EXPLAIN
WITH
targets AS (
SELECT 'bike' AS vehicle, id, dealer_name, frame_size FROM bikes WHERE
frame_size = 52
UNION ALL
SELECT 'car' AS vehicle, id, dealer_name, null::int AS frame_size FROM
cars
)
SELECT
dealers.name dealer,
targets.vehicle,
targets.id
FROM
dealers
JOIN targets
ON dealers.name = targets.dealer_name
WHERE
dealers.id IN (54,12,456)
;
---
Hash Join (cost=16.91..4587.77 rows=325 width=41) (actual
time=0.270..27.133 rows=275 loops=1)
Hash Cond: (""*SELECT* 1"".dealer_name = dealers.name)
-> Append (cost=0.00..4026.20 rows=108280 width=41) (actual
time=0.009..20.376 rows=108222 loops=1)
-> Subquery Scan on ""*SELECT* 1"" (cost=0.00..1908.80 rows=8280
width=41) (actual time=0.009..6.708 rows=8222 loops=1)
-> Seq Scan on bikes (cost=0.00..1826.00 rows=8280 width=45)
(actual time=0.009..6.114 rows=8222 loops=1)
Filter: (frame_size = 52)
Rows Removed by Filter: 91778
-> Seq Scan on cars (cost=0.00..1576.00 rows=100000 width=41)
(actual time=0.012..8.433 rows=100000 loops=1)
-> Hash (cost=16.88..16.88 rows=3 width=5) (actual time=0.017..0.018
rows=3 loops=1)
Buckets: 1024 Batches: 1 Memory Usage: 9kB
-> Index Scan using dealers_pkey on dealers (cost=0.28..16.88
rows=3 width=5) (actual time=0.009..0.015 rows=3 loops=1)
Index Cond: (id = ANY ('{54,12,456}'::integer[]))
Planning Time: 0.148 ms
Execution Time: 27.170 ms

Moving the WHERE clause to the top-level query produces a more efficient
query execution plan. Example:

EXPLAIN ANALYZE
WITH
targets AS (
SELECT 'bike' AS vehicle, id, dealer_name, frame_size FROM bikes
UNION ALL
SELECT 'car' AS vehicle, id, dealer_name, null::int AS frame_size FROM
cars
)
SELECT
dealers.name dealer,
targets.vehicle,
targets.id
FROM
dealers
JOIN targets
ON dealers.name = targets.dealer_name
WHERE
dealers.id IN (54,12,456)
AND (frame_size IS null OR frame_size = 52)
;
----
Nested Loop (cost=5.32..1443.67 rows=325 width=41) (actual
time=0.106..0.815 rows=275 loops=1)
-> Index Scan using dealers_pkey on dealers (cost=0.28..16.88 rows=3
width=5) (actual time=0.019..0.031 rows=3 loops=1)
Index Cond: (id = ANY ('{54,12,456}'::integer[]))
-> Append (cost=5.04..474.52 rows=108 width=41) (actual
time=0.048..0.250 rows=92 loops=3)
-> Bitmap Heap Scan on bikes (cost=5.04..237.10 rows=8 width=41)
(actual time=0.045..0.123 rows=9 loops=3)
Recheck Cond: (dealer_name = dealers.name)
Filter: ((frame_size IS NULL) OR (frame_size = 52))
Rows Removed by Filter: 91
Heap Blocks: exact=274
-> Bitmap Index Scan on bikes_dealer_name_idx
(cost=0.00..5.04 rows=100 width=0) (actual time=0.010..0.010 rows=100
loops=3)
Index Cond: (dealer_name = dealers.name)
-> Bitmap Heap Scan on cars (cost=5.07..236.87 rows=100 width=41)
(actual time=0.027..0.116 rows=82 loops=3)
Recheck Cond: (dealer_name = dealers.name)
Heap Blocks: exact=233
-> Bitmap Index Scan on cars_dealer_name_idx
(cost=0.00..5.04 rows=100 width=0) (actual time=0.017..0.017 rows=82
loops=3)
Index Cond: (dealer_name = dealers.name)
Planning Time: 0.238 ms
Execution Time: 0.863 ms

Unfortunately this workaround may become complex if there are multiple
filters for multiple tables, as each filtered column must be added to every
unioned query.

Here's a script to generate sample data for the example case:

CREATE TABLE dealers AS
SELECT
id,
(SELECT string_agg(CHR(65+(random() * 25)::integer), '') FROM
generate_series(1, 4) WHERE id>0) name
FROM generate_series(1, 1000) AS id
;
ALTER TABLE dealers ADD primary key (id);
CREATE INDEX ON dealers(name);

CREATE TABLE bikes AS
SELECT
generate_series AS id,
(SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int
WHERE generate_series>0)) AS dealer_name,
(random()*12+50)::int as frame_size
FROM generate_series(1, 100000);
ALTER TABLE bikes ADD primary key (id);
CREATE INDEX ON bikes(dealer_name);

CREATE TABLE cars AS
SELECT
generate_series as id,
(SELECT name FROM dealers WHERE dealers.id = (SELECT (random()*1000)::int
WHERE generate_series>0)) AS dealer_name,
(random()*7+14)::int as wheel_size
FROM generate_series(1, 100000);
ALTER TABLE cars ADD primary key (id);
CREATE INDEX ON cars(dealer_name);

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Zhaoxun Yan 2023-10-10 09:24:02 Re: pg_rewind: ERROR: could not fetch remote file "global/pg_control": ERROR: permission denied
Previous Message Andres Freund 2023-10-09 23:55:30 Re: Access to old versions of a row