Re: Adjusting hash join memory limit to handle batch explosion

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-13 17:29:05
Message-ID: e8c65bf0-0008-40a5-8467-9b829464be0a@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/13/25 17:32, Melanie Plageman wrote:
> On Sat, Jan 11, 2025 at 7:42 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> I had a quiet evening yesterday, so I decided to take a stab at this and
>> see how hard would it be, and how bad would the impact be. Attached is
>> an experimental patch, doing the *bare* minimum for a simple query:
>>
>> 1) It defines a limit of 128 batches (a bit low, but also 1MB). In
>> practice we'd use something like 256 - 1024, probably. Doesn't matter.
>>
>> 2) Ensures the initial pass over data in MultiExecPrivateHash does not
>> use more than 128 batches, switches to "tooManyBatches=true" if that
>> happens (and dumps the batch to file ExecHashDumpBatchToFile, even if
>> it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to
>> increase the nbatch further.
>>
>> 3) Does something similar for the outer relation - if there are too many
>> batches, we do ExecHashJoinRepartitionBatches() which first partitions
>> into 128 batches. This only does a single pass in the WIP, though.
>> Should be recursive or something.
>>
>> 4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so
>> that the code can free the buffers. It also means the buffer needs to be
>> allocated separately, not embedded in BufFile struct. (I'm a bit
>> surprised it works without having to re-read the buffer after freeing
>> it, but that's probably thanks to how hashjoin uses the files).
>
> I started looking at this. Even though you do explain what it does
> above, I still found it a bit hard to follow. Could you walk through
> an example (like the one you gave in SQL) but explaining what happens
> in the implementation? Basically what you have in 2 and 3 above but
> with a specific example.
>

OK, I'll try ... see the end of this message.

> This is my understanding of what this does:
> if we are at the max number of batches when building the hashtable and
> we run out of space and need to double nbatches, we
> 1. dump the data from the current batch that is in the hashtable into a file
> 2. close and flush are the currently open buffiles, double the number
> of batches, and then only open files for the batches we need to store
> tuples from the batch we were trying to put in the hashtable when we
> hit the limit (now in a temp file)
>

Roughly, but the second step needs to happen only after we finish the
first pass over the inner relation. I'll try to explain this as part of
the example.

> I also don't understand why ExecHashJoinRepartitionBatches() is needed
> -- I think it has something to do with needing a certain number of
> buffers open while processing batch 0, but what does this have to do
> with the outer side of the join?
>

No, this is about building batches on the outer side. We've built the
hash table, and we may have ended with a very high nbatch. We can't
build all of them right away (would need too many buffiles), so we do
that in multiple phases, to not cross the limit.

> Another random question: why doesn't ExecHashHandleTooManyBatches()
> free the outer batch files?
>

Because it was tailored for the example when all batch splits happen for
batch 0, before we even start processing the outer side. In practice it
probably should free the files.

Let's do the example - as I mentioned, I only tried doing this for the
case where all the batch increases happen for batch 0, before we start
building the outer batches. I'm 99% sure the patch will need to modify a
couple more places to handle batch increases in later stages.

Assume we don't want to use more than 128 batches, but that we're
running a query that needs 256 batches. The patch will do this:

1) ExecHashTableCreate will set nbatch_maximum=128 as the limit for the
current pass over inner relation, and it'll cap the other nbatch fields
accordingly. If we already know we'll need more batches, we set
tooManyBatches=true to remember this.

But let's we start with nbatch=64, nbatch_maximum=128 (and thus also
with tooManyBatches=false).

2) We start loading data into the hash table, until exceed the memory
limit for the first time. We double the number to 128, move some of the
data from the hash table to the new batch, and continue.

3) We hit the memory limit again, but this time we've hit

(nbatch == nbatch_maximum)

so we can't double the number of batches. But we also can't continue
adding data to the in-memory hash table, so we set tooManyBatches=true
and we start spilling even the current batch to a file.

4) We finish the first pass over the inner relation with

nbatch = 128
nbatch_maximum = 128
tooManyBatches = true

so we need to do something. We run ExecHashHandleTooManyBatches() starts
increasing the nbatches until the current batch fits into work_mem. We
have nbatch=128, and the query needs nbatch=256, so we only do one loop.

Note: Right now it simply doubles the number of batches in each loop.
But it could be faster and do up to 128 in one step.

128 -> 16k -> 1M

The later batches will already do all the increases in a single step,
that needs an improvement too.

4) After ExecHashHandleTooManyBatches completed, we have the inner side
of the batch mostly "done". We have nbatch=256.

5) We start building batches on the outer side, but we also don't want
to build all the batches at once - we want to build 128 and only then go
to 256 (or further). This is what ExecHashJoinRepartitionBatches does.

If we have too many batches for one pass, we build 128 batches in the
first pass. And then we just read the batch files, doing further splits.
Right now this just does a single pass and thus splits the relation into
128 batches, and then just continues as before. That's enough for 256
batches, because 256 is a single step past 128.

But it really should be recursive / do multiple passes, to handle more
cases with more than 16k batches (although with higher limit it would be
less of an issue).

5) It does free the file buffers in various places. Chances are some of
those places are unnecessary, and it should be done in some more places.

As I said, I don't claim this to handle all cases, especially with
splits in later batches.

Does this make it clearer?

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Michail Nikolaev 2025-01-13 17:31:04 Re: Conflict Detection and Resolution
Previous Message Alena Rybakina 2025-01-13 17:19:42 Re: Exists pull-up application with JoinExpr