Quick Links

Re: Adjusting hash join memory limit to handle batch explosion

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Adjusting hash join memory limit to handle batch explosion
Date:	2025-01-10 16:18:12
Message-ID:	05a2bc17-2242-4a60-8580-6b6eba6ccf9f@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 1/10/25 15:54, Melanie Plageman wrote:
> On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>>
>>
>> On 1/9/25 21:42, Melanie Plageman wrote:
>>>
>>> I was excited about your raw file experiment. As Robert and you point
>>> out -- we may need a file per batch, but for most of the hash join's
>>> execution we don't need to keep buffers for each batch around.
>>> However, given that the experiment didn't yield great results and we
>>> haven't come up with an alternative solution with sufficiently few
>>> flaws, I'm still in favor of 1.
>>>
>>
>> But I think those were two distinct proposals.
>>
>> My experiment with raw files keeps adding batches just like the current
>> code (so it might quickly explode to 1M batches) and then keep feeding
>> data to 1M files at the same time. This doesn't work, the buffering
>> clearly helps a lot, and it'd affect all hashjoins, even those with
>> fewer batches.
>
> I see.
>
>> Robert's idea kept using buffered files, but limited how many we can
>> fill at any phase. Say we'd use a limit of 1024 batches, but we actually
>> need 1M batches. Then we'd do the build in two phases - we'd generate
>> 1024 batches, and then we'd split each of those batches into 1024
>> smaller batches. The trick (as I understand it) is those batches can't
>> overlap, so we'd not need more than 1024 batches, which greatly limits
>> the memory consumption. We could even use a lower limit, derived from
>> work_mem or something like that.
>
> I think this is because we get the batch based on
>
> *batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
> (nbatch - 1);
>
> And tuples can only spill forward. I think Robert's example is if we
> plan for 64 batches and eventually increase to 256 batches, a tuple
> assigned to batch 1 could go to 65, 129, or 193 but no other batch --
> meaning we would only need 3 files open when processing batch 1.

Yes, I think that's why we only need 3 more files when splitting a
batch. The way I explain it is that going from 64 -> 256 adds 2 more
bits to the "batchno" part of the batch, and one of the patterns means
"current batch", so 3 new files.

This does remind me the "one spill file per slice" patch in [1],
although it approaches it from a different angle. My patch defined the
"slice" as batches we can keep in work_mem, while Robert proposed to
decide how many batches we can open (first level of batching), and then
maybe do that recursively if needed. That seems like a fundamentally
more sound approach (indeed, my patch can create too many slices).

[1]
https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development

> But I think we would need to do more explicit file flushing and
> closing and opening, right? Which maybe doesn't matter when compared
> to the overhead of so many more buffers.
Would it be all that much flushing and closing? Yes, we'd need to flush
and release the buffers (which I don't think BufFiles can do right now,
but let's ignore that for now). But I'd hope the batches are fairly
large (because that's why expect to generate them), so one more write
should not make a lot of difference on top of the actual bach split.
Chances are it's far cheaper than the extra memory pressure due to
keeping all batches in memory ...

I wonder if it might be a problem that those future batches are "far
apart". I mean, we're splitting batch 1, and the tuple may go to batches
+64, +128 and +192 batches ahead. If we allow larger "jumps" (e.g. from
256 to 64k batches), it'd be even more visible.

For the case of batch explosion I don't think it matters too much - it
will still explode into absurd number of batches, that doesn't change.
But that's fine, the point is to not cause OOM. Improving this case
would require increasing the work_mem limit (either directly or by
stopping the growth).

For regular cases I think the idea is the limit would be high enough to
not really hit this too often. I mean, how many real-world queries use
more than ~1024 batches? I don't think that's very common.

>> Of course, this is a more complex change than the "balancing" patch. But
>> maybe not that much, not sure. For me the main disadvantage is it
>> doesn't really help with the batch explosion for skewed data sets (or
>> data with many hash collisions). It can easily happen we blindly
>> increase nbatch until we use all the bits, and then break the work_mem
>> limit anyway.
>>
>> But maybe there's a way to address that - the growthEnabled=false safety
>> is an unreliable solution, because it requires the whole batch to fall
>> to either of the new batches. A single tuple breaks that.
>>
>> What if we instead compared the two new batches, and instead looked at
>> how far the split is from 1/2? And if it's very far from 1/2, we'd
>> either increase work_mem (a bit like the balancing), or disable nbatch
>> increases (maybe just temporarily).
>
> Meaning like have some threshold for the number of tuples over the
> limit we are? Right now, we decide to increase batches when we
> encounter that one tuple that puts us over the limit. So, I could see
> it making sense to decide with more foresight. Or we could even keep
> track of the amount over the limit we are and increase the number of
> batches once we hit that threshold.
>

Not sure I understand. I meant that we disable nbatch growth like this:

if (nfreed == 0 || nfreed == ninmemory)
{
hashtable->growEnabled = false;
}

which means that it only takes a single tuple that makes it to the other
batch to keep growing. But if 99.9999% tuples went to one of the
batches, increasing nbatch seems pretty futile.

But it goes in the opposite direction too. Imagine a uniform data set
with plenty of distinct values, but correlated / sorted, and each value
having more rows that can fit into a single batch. We'll immediately
disable growth, which is ... not great.

These are somewhat separate / independent issues, but I thin having a
concept of "retrying the nbatch growth after a while" would help.

> This kind of seems like it would circle back to your algorithm for
> deciding on the right tradeoff between hashtable size and number of
> batches, though.

Yes, it's about the same general idea, just expressed in a slightly
different way (the growing the work_mem part).

>
> You could do something like this _and_ do something like close the
> files that can't be the target of tuples from the current batch --
> which would allow you to tolerate many more batch increases before
> doubling the hashtable size is worth it. But it seems like the
> algorithm to adapt the hashtable size based on the optimal tradeoff
> between hashtable size and number of batches could be done first and
> the patch to close files could be done later.
>

Right. I don't think Robert's idea is a a complete answer, because it
does not consider the tradeoff that maybe increasing work_mem would be
better. OTOH maybe that's not something the hashjoin should worry about.
The goal is not to optimize the work_mem value, but make sure we don't
use significantly more memory ...

If hashjoin starts to optimize this, why shouldn't the other places
using work_mem do something similar?

regards

--
Tomas Vondra

In response to

Re: Adjusting hash join memory limit to handle batch explosion at 2025-01-10 14:54:54 from Melanie Plageman

Responses

Re: Adjusting hash join memory limit to handle batch explosion at 2025-01-10 23:09:24 from Melanie Plageman

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Sami Imseih	2025-01-10 16:21:56	Re: POC: track vacuum/analyze cumulative time per relation
Previous Message	Sami Imseih	2025-01-10 16:07:43	Re: POC: track vacuum/analyze cumulative time per relation