Re: Adjusting hash join memory limit to handle batch explosion

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-12 00:42:28
Message-ID: 508fda6f-5a9d-42c9-af6e-aa94ed1210ec@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/10/25 15:54, Melanie Plageman wrote:
> On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> ...
>
>> Robert's idea kept using buffered files, but limited how many we can
>> fill at any phase. Say we'd use a limit of 1024 batches, but we actually
>> need 1M batches. Then we'd do the build in two phases - we'd generate
>> 1024 batches, and then we'd split each of those batches into 1024
>> smaller batches. The trick (as I understand it) is those batches can't
>> overlap, so we'd not need more than 1024 batches, which greatly limits
>> the memory consumption. We could even use a lower limit, derived from
>> work_mem or something like that.
>
> I think this is because we get the batch based on
>
> *batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
> (nbatch - 1);
>
> And tuples can only spill forward. I think Robert's example is if we
> plan for 64 batches and eventually increase to 256 batches, a tuple
> assigned to batch 1 could go to 65, 129, or 193 but no other batch --
> meaning we would only need 3 files open when processing batch 1. But I
> think we would need to do more explicit file flushing and closing and
> opening, right? Which maybe doesn't matter when compared to the
> overhead of so many more buffers.
>

I had a quiet evening yesterday, so I decided to take a stab at this and
see how hard would it be, and how bad would the impact be. Attached is
an experimental patch, doing the *bare* minimum for a simple query:

1) It defines a limit of 128 batches (a bit low, but also 1MB). In
practice we'd use something like 256 - 1024, probably. Doesn't matter.

2) Ensures the initial pass over data in MultiExecPrivateHash does not
use more than 128 batches, switches to "tooManyBatches=true" if that
happens (and dumps the batch to file ExecHashDumpBatchToFile, even if
it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to
increase the nbatch further.

3) Does something similar for the outer relation - if there are too many
batches, we do ExecHashJoinRepartitionBatches() which first partitions
into 128 batches. This only does a single pass in the WIP, though.
Should be recursive or something.

4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so
that the code can free the buffers. It also means the buffer needs to be
allocated separately, not embedded in BufFile struct. (I'm a bit
surprised it works without having to re-read the buffer after freeing
it, but that's probably thanks to how hashjoin uses the files).

Anyway, this seems to work, and a simple experiment looks like this:

--------------------------------------------------------------------
create table t (a int, b text);

insert into t select i, md5(i::text)
from generate_series(1,100000) s(i);

insert into t select i, md5(i::text)
from generate_series(1,100000) s(i);

insert into t select i, md5(i::text)
from generate_series(1,100000) s(i);

insert into t select i, md5(i::text)
from generate_series(1,100000) s(i);

vacuum analyze;

set work_mem='128kB';
explain analyze select * from t t1 join t t2 on (t1.a = t2.a);
--------------------------------------------------------------------

This is just enough to need 256 batches, i.e. "one doubling" over the
128 batch limit.

On master I get this:

QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=15459.00..51555.40 rows=1638740 width=74) (actual
time=80.065..337.192 rows=1600000 loops=1)
Hash Cond: (t1.a = t2.a)
Buffers: shared hit=6668, temp read=5806 written=5806
-> Seq Scan on t t1 (cost=0.00..7334.00 rows=400000 width=37)
(actual time=0.008..19.867 rows=400000 loops=1)
Buffers: shared hit=3334
-> Hash (cost=7334.00..7334.00 rows=400000 width=37) (actual
time=76.824..76.824 rows=400000 loops=1)
Buckets: 4096 Batches: 256 Memory Usage: 132kB
Buffers: shared hit=3334, temp written=2648
-> Seq Scan on t t2 (cost=0.00..7334.00 rows=400000 width=37)
(actual time=0.001..20.855 rows=400000 loops=1)
Buffers: shared hit=3334
Planning Time: 0.100 ms
Execution Time: 385.779 ms
(12 rows)

while with the patch we get this:

QUERY PLAN
-----------------------------------------------------------------------
Hash Join (cost=15459.00..51555.40 rows=1638740 width=74) (actual
time=93.346..325.604 rows=1600000 loops=1)
Hash Cond: (t1.a = t2.a)
Buffers: shared hit=6668, temp read=8606 written=8606
-> Seq Scan on t t1 (cost=0.00..7334.00 rows=400000 width=37)
(actual time=0.006..14.416 rows=400000 loops=1)
Buffers: shared hit=3334
-> Hash (cost=7334.00..7334.00 rows=400000 width=37) (actual
time=48.481..48.482 rows=400000 loops=1)
Buckets: 4096 (originally 4096) Batches: 256 (originally 128)
Memory Usage: 132kB
Buffers: shared hit=3334, temp read=23 written=2860
-> Seq Scan on t t2 (cost=0.00..7334.00 rows=400000 width=37)
(actual time=0.001..14.754 rows=400000 loops=1)
Buffers: shared hit=3334
Planning Time: 0.061 ms
Execution Time: 374.229 ms
(12 rows)

So for this particular query there doesn't seem to be a particularly
massive hit. With more batches that's unfortunately not the case, but
that seems to be mostly due to looping over all batches when freeing
buffers. Looping over 1M batches (which we do for every batch) is
expensive, but that could be improved somehow - we only have a couple
files open (say 1024), so we could keep them in a list or something. And
I think we don't need to free the batches this often anyway, we might
not even opened any future batches.

I'm sure there's plenty of issues with the patch - e.g. it may not not
handle nbatch increases later (after batch 0), I ignored skew buckets,
and stuff like that ...

But it does seem like a workable idea ...

regards

--
Tomas Vondra

Attachment Content-Type Size
hashjoin-file-limit.patch text/x-patch 19.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2025-01-12 01:26:47 Re: Recovering from detoast-related catcache invalidations
Previous Message Michael Paquier 2025-01-12 00:11:14 Re: Possible integer overflow in bringetbitmap()