Partitionwise JOIN scanning all partitions, even unneeded ones

From: Dimitrios Apostolou <jimis(at)gmx(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Partitionwise JOIN scanning all partitions, even unneeded ones
Date: 2023-08-30 19:55:01
Message-ID: 704abd8e-5b3e-f946-fb7c-58d39eb1268e@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello list,

I believe I'm facing a performance bug in PostgreSQL with partitionwise
JOINs. I have reduced the issue to the minimum queries I could, please
read-on for details and see attached files for EXPLAIN output with timings
and database settings.

I'd appreciate help on whether it's a real issue, and if it's unknown I
can forward this to the psql-bugs mailing list. I'd also appreciate any
critique on the clarity of my description and on my schema and queries,
since I'm new to postgres.

==== Short description

+ Each textual "task_id" is described by a unique number "task_n".
+ Each task can have many workitems, and this relation is stored in
"tasks_mm_workitems". A workitem has a unique number "workitem_n".
+ I have a massive (10G rows) table called test_runs_raw. It contains
a list of test runs for each workitem_n.
+ TABLE test_runs_raw is PARTITIONED into 1000
partitions (many of them empty, reserved for future)
according to RANGE(workitem_n).

SUMMARY: I want to do stuff with all test_runs of a task.
Basically I want JOIN tables in order to go from one "task_id" to
one "task_n" to many "workitem_n" to manymany test runs.

PROBLEM: If I JOIN the 3 tables to do that, postgres scans ALL partitions
of the massive table test_runs_raw, and this takes hours.

On the other hand, if I omit the 1st part of the JOIN and provide
the "task_n" manually, and JOIN only the 2 latter tables,
postgres only scans the relevant partitions and my queries
return fast.

VERSION:
PostgreSQL 15.4 (Ubuntu 15.4-1.pgdg22.04+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0, 64-bit

==== Queries

Here are some seemingly meaningless queries, that are just minimal
reproducers of the issue (you can ignore "DISTINCT" part, it's just there
to limit the output I was getting).

===== Slow query:

EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT DISTINCT workitem_n
FROM task_ids
JOIN tasks_mm_workitems USING(task_n)
JOIN test_runs_raw USING(workitem_n)
WHERE task_id = '1698813977';

The EXPLAIN output here shows a parallel hash join doing seq scans on each
and every partition. Basically the whole 10G rows table is being
seq-scanned.

===== Fast query:

Here I resolve the "task_n" from the "task_id" manually, and omit the
first part of the JOIN.

SELECT task_n FROM task_ids WHERE task_id = '1698813977';
task_n
--------
59854
(1 row)

EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT DISTINCT workitem_n
FROM tasks_mm_workitems
JOIN test_runs_raw USING(workitem_n)
WHERE task_n = 59854;

The EXPLAIN output here shows "(never executed)" on most partition seq
scans. The relevant partitions are index-scanned. Basically only a few
partitions are accessed, and only partially through their index on
"workitem_n". This is excellent and what I would expect in all cases.

==== Schema description

> \d+ task_ids
Column | Type | Collation | Nullable | Default
---------+---------+-----------+----------+------------------------------------------
task_n | integer | | not null |
nextval('task_ids_task_n_seq'::regclass)
task_id | text | | not null |
Indexes:
"task_ids_pkey" PRIMARY KEY, btree (task_n)
"task_ids_task_id_key" UNIQUE CONSTRAINT, btree (task_id)

> \d tasks_mm_workitems
Table "public.tasks_mm_workitems"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
task_n | integer | | not null |
workitem_n | integer | | not null |
Indexes:
"tasks_mm_workitems_pkey" PRIMARY KEY, btree (task_n, workitem_n)
Foreign-key constraints:
"fk1_tasks_mm_workitems" FOREIGN KEY (task_n) REFERENCES
task_ids(task_n)
"fk2_tasks_mm_workitems" FOREIGN KEY (workitem_n) REFERENCES
workitem_ids(workitem_n)

> \d test_runs_raw
Partitioned table "public.test_runs_raw"
Column | Type | Collation | Nullable |
Default
-------------------+-----------------------------+-----------+----------+----------------------------------
run_n | bigint | | not null |
generated by default as identity
[...]
workitem_n | integer | | not null |
[...]
Partition key: RANGE (workitem_n)
Indexes:
"test_runs_raw_partitioned_pkey" PRIMARY KEY, btree (workitem_n,
run_n), tablespace "archival_tablespace_1"
Foreign-key constraints:
[...]
"test_runs_raw_partitioned_workitem_n_fkey" FOREIGN KEY (workitem_n)
REFERENCES workitem_ids(workitem_n)
Number of partitions: 1000 (Use \d+ to list them.)
Tablespace: "tablespace1"

Thank you in advance,
Dimitris

Attachment Content-Type Size
partitionwise_join-fast.explain.log text/plain 369.6 KB
partitionwise_join-slow.explain.log text/plain 348.9 KB

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David Rowley 2023-08-30 20:56:41 Re: Partitionwise JOIN scanning all partitions, even unneeded ones
Previous Message David G. Johnston 2023-08-30 18:44:49 Re: event trigger clarification