Re: Adjusting hash join memory limit to handle batch explosion

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-09 23:59:29
Message-ID: 5a440920-4ca4-4bd0-8b42-a5aa7af4fb70@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/9/25 21:42, Melanie Plageman wrote:
> On Tue, Dec 31, 2024 at 6:07 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> This means that ultimately it's either (1) or (3), and the more I've
>> been looking into this the more I prefer (1), for a couple reasons:
>>
>> * It's much simpler (it doesn't really change anything on the basic
>> behavior, doesn't introduce any new files or anything like that.
>>
>> * There doesn't seem to be major difference in total memory consumption
>> between the two approaches. Both allow allocating more memory.
>>
>> * It actually helps with the "indivisible batch" case - it relaxes the
>> limit, so there's a chance the batch eventually fits and we stop adding
>> more and more batches. With spill files that's not the case - we still
>> keep the original limit, and we end up with the batch explosion (but
>> then we handle it much more efficiently).
>
> Okay, I've read all the patches proposed in this mail and most of the
> downthread ideas, and I want to cast my vote for option 1.
> I find the design of 3 too complicated for what it buys us.
>
> The slices make it harder to understand how the system works. The
> current 1-1 relationship in master between batches and spill files is
> easier to reason about. With the slices, I'm imagining trying to
> understand why we, for example, have to move tuples from batch 4 just
> because batch 5 got too big for the hashtable.
>
> I think something like this might be worth it if it solved the problem
> entirely, but if it is just a somewhat better coping mechanism, I
> don't think it is worth it.
>
> I was excited about your raw file experiment. As Robert and you point
> out -- we may need a file per batch, but for most of the hash join's
> execution we don't need to keep buffers for each batch around.
> However, given that the experiment didn't yield great results and we
> haven't come up with an alternative solution with sufficiently few
> flaws, I'm still in favor of 1.
>

But I think those were two distinct proposals.

My experiment with raw files keeps adding batches just like the current
code (so it might quickly explode to 1M batches) and then keep feeding
data to 1M files at the same time. This doesn't work, the buffering
clearly helps a lot, and it'd affect all hashjoins, even those with
fewer batches.

Robert's idea kept using buffered files, but limited how many we can
fill at any phase. Say we'd use a limit of 1024 batches, but we actually
need 1M batches. Then we'd do the build in two phases - we'd generate
1024 batches, and then we'd split each of those batches into 1024
smaller batches. The trick (as I understand it) is those batches can't
overlap, so we'd not need more than 1024 batches, which greatly limits
the memory consumption. We could even use a lower limit, derived from
work_mem or something like that.

Of course, this is a more complex change than the "balancing" patch. But
maybe not that much, not sure. For me the main disadvantage is it
doesn't really help with the batch explosion for skewed data sets (or
data with many hash collisions). It can easily happen we blindly
increase nbatch until we use all the bits, and then break the work_mem
limit anyway.

But maybe there's a way to address that - the growthEnabled=false safety
is an unreliable solution, because it requires the whole batch to fall
to either of the new batches. A single tuple breaks that.

What if we instead compared the two new batches, and instead looked at
how far the split is from 1/2? And if it's very far from 1/2, we'd
either increase work_mem (a bit like the balancing), or disable nbatch
increases (maybe just temporarily).

> The part of 1 I struggled to understand is the math in
> ExecHashExceededMemoryLimit(). I think the other email you sent with
> the charts and diagonals is about choosing the optimal hashtable size
> and number of batches (when to stop growing the number of batches and
> just increase the size of the hashtable). So, I'll dive into that.
>

That math is a bit unclear even to me, that patch was written before I
took the time to work out the formulas and visualizations. It works and
does about the right decisions, but with less rigor. So maybe don't
waste too much time trying to understand it.

>> One thing I'm not sure about yet is whether this needs to tweak the
>> hashjoin costing to also consider the files when deciding how many
>> batches to use. Maybe it should?
>
> I think it definitely should. The ExecChooseHashTableSize()
> calculations look similar to what we use to calculate spaceAllowed, so
> it makes sense that we would consider buffile sizes if we are counting
> those in spaceUsed now.
>

Yeah. I think the flaw is we may not actually know the number of batches
during planning. In the batch explosion example we start with very few
batches, that only happens during execution.

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-01-10 00:13:08 Re: IANA timezone abbreviations versus timezone_abbreviations
Previous Message Nathan Bossart 2025-01-09 23:27:03 Re: Proposal: add new API to stringinfo