Re: Adjusting hash join memory limit to handle batch explosion

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-13 16:32:04
Message-ID: CAAKRu_YiLv24QKKJvUtDoO+=tw9Z5mt2i1g5=0992nmtwRekBQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, Jan 11, 2025 at 7:42 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> I had a quiet evening yesterday, so I decided to take a stab at this and
> see how hard would it be, and how bad would the impact be. Attached is
> an experimental patch, doing the *bare* minimum for a simple query:
>
> 1) It defines a limit of 128 batches (a bit low, but also 1MB). In
> practice we'd use something like 256 - 1024, probably. Doesn't matter.
>
> 2) Ensures the initial pass over data in MultiExecPrivateHash does not
> use more than 128 batches, switches to "tooManyBatches=true" if that
> happens (and dumps the batch to file ExecHashDumpBatchToFile, even if
> it's batchno=0). And later it calls ExecHashHandleTooManyBatches() to
> increase the nbatch further.
>
> 3) Does something similar for the outer relation - if there are too many
> batches, we do ExecHashJoinRepartitionBatches() which first partitions
> into 128 batches. This only does a single pass in the WIP, though.
> Should be recursive or something.
>
> 4) Extends the BufFile API with BufFileHasBuffer/BufFileFreeBuffer so
> that the code can free the buffers. It also means the buffer needs to be
> allocated separately, not embedded in BufFile struct. (I'm a bit
> surprised it works without having to re-read the buffer after freeing
> it, but that's probably thanks to how hashjoin uses the files).

I started looking at this. Even though you do explain what it does
above, I still found it a bit hard to follow. Could you walk through
an example (like the one you gave in SQL) but explaining what happens
in the implementation? Basically what you have in 2 and 3 above but
with a specific example.

This is my understanding of what this does:
if we are at the max number of batches when building the hashtable and
we run out of space and need to double nbatches, we
1. dump the data from the current batch that is in the hashtable into a file
2. close and flush are the currently open buffiles, double the number
of batches, and then only open files for the batches we need to store
tuples from the batch we were trying to put in the hashtable when we
hit the limit (now in a temp file)

I also don't understand why ExecHashJoinRepartitionBatches() is needed
-- I think it has something to do with needing a certain number of
buffers open while processing batch 0, but what does this have to do
with the outer side of the join?

Another random question: why doesn't ExecHashHandleTooManyBatches()
free the outer batch files?

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2025-01-13 17:19:42 Re: Exists pull-up application with JoinExpr
Previous Message Nathan Bossart 2025-01-13 16:18:03 Re: [PATCH] Hex-coding optimizations using SVE on ARM.