Slow performance with trivial self-joins

From: Benny Kramek <benny(at)medflyt(dot)com>
To: pgsql-performance(at)lists(dot)postgresql(dot)org
Subject: Slow performance with trivial self-joins
Date: 2020-02-03 20:38:15
Message-ID: CAGPCyEZWEVjkDtuHipOaPKbO2WXdLUuNBhj818-wEuhi4KLbzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I am experiencing slow performance when joining a table against itself on its
primary key column.

I expect the query plan to be identical for both of the below queries (and I
expect the performance to also be identical). But the second one is much slower:

The FAST QUERY has a planning time of 0.110 ms and execution time of 3.836 ms
The SLOW QUERY has a planning time of 0.296 ms and execution time of 22.969 ms

The reason I believe that they should be the same is because the postgres query
planner should notice that I am joining a table against itself on its primary
key column (which is not null + unique) and therefore it should realize that it
doesn't actually have to do any additional work and can simply directly access
the existing columns.

I've tested this on PostgreSQL 10, 11, 12, 12.1 and 13devel (source snapshot
from 2020-02-03, git commit f1f10a1ba9e17e606a7b217ccccdd3cc4d8cb771)

Here is a full example session:

---------
-- SETUP
---------

CREATE TABLE test_data (
id SERIAL4 PRIMARY KEY,
value TEXT
);

INSERT INTO test_data (value)
SELECT value FROM (
SELECT
generate_series(1, 100000) AS id,
md5(random()::TEXT) AS value
) q;

--------------
-- FAST QUERY
--------------

EXPLAIN ANALYZE SELECT
test_data.id,
md5(test_data.value) AS x,
md5(md5(md5(md5(md5(test_data.value))))) AS y
FROM
test_data
WHERE TRUE
AND test_data.id BETWEEN 3000 AND 4000;

--------------
-- SLOW QUERY
--------------

EXPLAIN ANALYZE SELECT
test_data.id,
md5(test_data.value) AS x,
md5(md5(md5(md5(md5(t2.value))))) AS y
FROM
test_data,
test_data AS t2
WHERE TRUE
AND t2.id = test_data.id
AND test_data.id BETWEEN 3000 AND 4000;

--- END ---

Here is the query plan of the FAST QUERY:

Index Scan using test_data_pkey on test_data (cost=0.29..60.17
rows=1025 width=68) (actual time=0.047..3.747 rows=1001 loops=1)
Index Cond: ((id >= 3000) AND (id <= 4000))
Planning Time: 0.110 ms
Execution Time: 3.836 ms
(4 rows)

Here is the query plan of the SLOW QUERY:

Hash Join (cost=57.60..2169.49 rows=1025 width=68) (actual
time=1.372..22.876 rows=1001 loops=1)
Hash Cond: (t2.id = test_data.id)
-> Seq Scan on test_data t2 (cost=0.00..1834.00 rows=100000
width=37) (actual time=0.010..8.800 rows=100000 loops=1)
-> Hash (cost=44.79..44.79 rows=1025 width=37) (actual
time=0.499..0.499 rows=1001 loops=1)
Buckets: 2048 Batches: 1 Memory Usage: 84kB
-> Index Scan using test_data_pkey on test_data
(cost=0.29..44.79 rows=1025 width=37) (actual time=0.023..0.287
rows=1001 loops=1)
Index Cond: ((id >= 3000) AND (id <= 4000))
Planning Time: 0.296 ms
Execution Time: 22.969 ms
(9 rows)

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Tom Lane 2020-02-03 21:10:20 Re: Slow performance with trivial self-joins
Previous Message Marco Colli 2020-02-02 18:55:59 Re: Statistics on array values