From: | Dimitrios Apostolou <jimis(at)gmx(dot)net> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | David Rowley <dgrowleyml(at)gmail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: Why is a hash join preferred when it does not fit in work_mem |
Date: | 2023-01-16 19:05:30 |
Message-ID: | 78211cab-8f6e-52e8-8d06-26e7097c43bd@gmx.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Sat, 14 Jan 2023, Tom Lane wrote:
> Dimitrios Apostolou <jimis(at)gmx(dot)net> writes:
>> Please correct me if I'm wrong, as I'm a newcomer to PostgreSQL, but here
>> is how I understand things according to posts I've read, and classical
>> algorithms:
>
>> + The Hash Join is fastest when one side fits in work_mem. Then on one
>> hand you have a hash table lookup (amortized O(1)) and on the other
>> hand, if the table has M rows that that do not fit in memory, you have
>> sequential reads from the disk (given low fragmentation of the table or
>> index files): For every line you read from the disk, you lookup the key
>> in the hash table.
>
>> If the hash table does not fit in RAM then the cost becomes prohibitive.
>> Every lookup is a random access possibly hitting the disk. The total
>> cost should be random_page_cost * M.
>
> That would be true of a simple hash join, but Postgres uses batched
> hash joins: we split up the hash key space into subsets, where hopefully
> each subset includes few enough inner-side rows to fit into work_mem.
> While this can go wrong given pathological distribution of the inner-side
> keys, it does mean that the join can perform well even when the inner
> side is much larger than work_mem. So it's not the case that the planner
> will simply disregard hash joins beyond work_mem. It will apply a cost
> penalty for the predicted batching overhead;
Thanks for this, I found a page [1] that describes the hash join and
now I understand a bit more.
[1] https://www.interdb.jp/pg/pgsql03.html
I'm not sure whether the key distribution is pathological in my case.
The join condition is:
Hash Cond: (tasks_mm_workitems.workitem_n = workitem_ids.workitem_n)
and workitem_ids.workitem_n is an integer GENERATED AS IDENTITY and PUBLIC
KEY. The TABLE workitem_ids har 1.7M rows, and the other table has 3.7M
rows. None of them fit in workmem.
In my (simplified and pathological) case of work_mem == 1MB, the hash join
does 512 batches (Buckets: 4,096 Batches: 512 Memory Usage: 759kB). I'm
not sure which hash-merge strategy is followed, but based on that
document, it should be the "hybrid hash join with skew". I don't quite
follow the I/O requirements of this algorithm, yet. :-)
> but that can still come out
> cheaper than merge join, because the sorting needed for merge is generally
> also far from cheap.
I was under the impression that on-disk merge-sort is a relatively cheap
(logN) operation, regarding random I/O.
>
>> So I would expect an increased random_page_cost to benefit the Merge Join
>> algorithm. And since my setup involves spinning disks, it does makes sense
>> to increase it.
>
> What is probably really happening is that random_page_cost affects the
> estimated cost of performing the sort using an index scan instead of
> a bespoke sort step. AFAIR, cost_sort doesn't consider random_page_cost
> at all, and neither does cost_hashjoin.
On the last EXPLAIN I posted for the forced merge-join, I see that it uses
an index-scan on the "small" table. It makes sense since the join happens
on the primary key of the table. On the large table it does not use an
index scan, because an index doesn't exist for that column. It sorts the
3.7M rows of the table (and FWIW that table only has two integer columns).
If I understood correctly what you meant with "performing the sort using
an index scan".
The problem I see is that the estimated cost of the sort operation is
609,372.91..618,630.40. It's already way above the whole hash-join cost
(121,222.68..257,633.01). However the real timings are very different.
Actual time for Sort is 4,602.569..5,414.072 ms while for the whole hash
join it is 145,641.295..349,682.387 ms.
Am I missing some configuration knobs to put some sense to the planner?
Thanks,
Dimitris
From | Date | Subject | |
---|---|---|---|
Next Message | Ron | 2023-01-16 21:18:03 | Re: Why is a Read-only Table Gets Autovacuumed "to prevent wraparound" |
Previous Message | Torsten Förtsch | 2023-01-16 18:59:33 | minor bug |