Why is a hash join preferred when it does not fit in work_mem

From: Dimitrios Apostolou <jimis(at)gmx(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Why is a hash join preferred when it does not fit in work_mem
Date: 2023-01-12 16:24:20
Message-ID: 7ee75988-8408-36f8-3024-93dbf45b6f19@gmx.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello list,

I have a very simple NATURAL JOIN that does not fit in the work_mem. Why
does the query planner prefer a hash join that needs 361s, while with a
sort operation and a merge join it takes only 13s?

The server is an old Mac Mini with hard disk drive and only 4GB RAM.
Postgres version info:

PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang
version 12.0.0 (clang-1200.0.32.29), 64-bit

The low work_mem and the disabled memoization are set on purpose, in order
to simplify a complex query, while reproducing the same problem that I
experienced there. This result is the simplest query I could get, where
the optimizer does not go for a faster merge join.

From my point of view a merge join is clearly faster, because the hash
table does not fit in memory and I expect a hash join to do a lot of
random I/O. But the query planner does not see that, and increasing
random_page_cost does not help either. In fact the opposite happens: the
merge join gets a higher cost difference to the hash join, as I increase
the random page cost!

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM
tasks_mm_workitems NATURAL JOIN workitem_ids;

QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Hash Join (cost=121222.68..257633.01 rows=3702994 width=241) (actual
time=184498.464..360606.257 rows=3702994 loops=1)
Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n,
workitem_ids.workitem_id
Inner Unique: true
Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)
Buffers: shared hit=15068 read=47434, temp read=56309 written=56309
-> Seq Scan on public.tasks_mm_workitems (cost=0.00..53488.94
rows=3702994 width=8) (actual time=0.040..1376.084 rows=3702994 loops=1)
Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
Buffers: shared read=16459
-> Hash (cost=59780.19..59780.19 rows=1373719 width=237) (actual
time=184361.874..184361.875 rows=1373737 loops=1)
Output: workitem_ids.workitem_id, workitem_ids.workitem_n
Buckets: 4096 Batches: 512 Memory Usage: 759kB
Buffers: shared hit=15068 read=30975, temp written=43092
-> Seq Scan on public.workitem_ids (cost=0.00..59780.19
rows=1373719 width=237) (actual time=0.026..1912.312 rows=1373737 loops=1)
Output: workitem_ids.workitem_id, workitem_ids.workitem_n
Buffers: shared hit=15068 read=30975
Settings: effective_cache_size = '500MB', enable_memoize = 'off',
hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem
= '1MB'
Planning:
Buffers: shared hit=2 read=6
Planning Time: 0.568 ms
Execution Time: 361106.876 ms
(20 rows)

# EXPLAIN (ANALYZE,VERBOSE,BUFFERS,SETTINGS) SELECT * FROM
tasks_mm_workitems NATURAL JOIN workitem_ids;

QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Merge Join (cost=609453.49..759407.78 rows=3702994 width=241) (actual time=5062.513..10866.313 rows=3702994 loops=1)
Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n, workitem_ids.workitem_id
Merge Cond: (workitem_ids.workitem_n = tasks_mm_workitems.workitem_n)
Buffers: shared hit=5343 read=66053, temp read=32621 written=32894
-> Index Scan using workitem_ids_pkey on public.workitem_ids
(cost=0.43..81815.86 rows=1373719 width=237) (actual time=0.111..1218.363 rows=1373737 loops=1)
Output: workitem_ids.workitem_n, workitem_ids.workitem_id
Buffers: shared hit=5310 read=49627
-> Materialize (cost=609372.91..627887.88 rows=3702994 width=8) (actual time=5062.389..7392.640 rows=3702994 loops=1)
Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
Buffers: shared hit=33 read=16426, temp read=32621 written=32894
-> Sort (cost=609372.91..618630.40 rows=3702994 width=8) (actual time=5062.378..6068.703 rows=3702994 loops=1)
Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
Sort Key: tasks_mm_workitems.workitem_n
Sort Method: external merge Disk: 65256kB
Buffers: shared hit=33 read=16426, temp read=32621 written=32894
-> Seq Scan on public.tasks_mm_workitems
(cost=0.00..53488.94 rows=3702994 width=8) (actual time=0.045..1177.202 rows=3702994 loops=1)
Output: tasks_mm_workitems.workitem_n, tasks_mm_workitems.task_n
Buffers: shared hit=33 read=16426
Settings: effective_cache_size = '500MB', enable_hashjoin = 'off', enable_memoize = 'off', hash_mem_multiplier = '1', max_parallel_workers_per_gather = '1', work_mem = '1MB'
Planning:
Buffers: shared hit=8
Planning Time: 0.677 ms
Execution Time: 13364.545 ms
(23 rows)

Thank you in advance,
Dimitris

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2023-01-12 16:27:46 Re: Changing displayed time zone in RAISE NOTICE output?
Previous Message Pavel Stehule 2023-01-12 16:22:33 Re: Changing displayed time zone in RAISE NOTICE output?