From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | Melanie Plageman <melanieplageman(at)gmail(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Adjusting hash join memory limit to handle batch explosion |
Date: | 2025-01-11 23:49:08 |
Message-ID: | 11efccd9-657c-4132-b4a7-28830ea8f121@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 1/11/25 00:09, Melanie Plageman wrote:
> On Fri, Jan 10, 2025 at 11:18 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> On 1/10/25 15:54, Melanie Plageman wrote:
>>> On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>> I think this is because we get the batch based on
>>>
>>> *batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
>>> (nbatch - 1);
>>>
>>> And tuples can only spill forward. I think Robert's example is if we
>>> plan for 64 batches and eventually increase to 256 batches, a tuple
>>> assigned to batch 1 could go to 65, 129, or 193 but no other batch --
>>> meaning we would only need 3 files open when processing batch 1.
>>
>> Yes, I think that's why we only need 3 more files when splitting a
>> batch. The way I explain it is that going from 64 -> 256 adds 2 more
>> bits to the "batchno" part of the batch, and one of the patterns means
>> "current batch", so 3 new files.
>>
>> This does remind me the "one spill file per slice" patch in [1],
>> although it approaches it from a different angle. My patch defined the
>> "slice" as batches we can keep in work_mem, while Robert proposed to
>> decide how many batches we can open (first level of batching), and then
>> maybe do that recursively if needed. That seems like a fundamentally
>> more sound approach (indeed, my patch can create too many slices).
>>
>> [1]
>> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>>
>>> But I think we would need to do more explicit file flushing and
>>> closing and opening, right? Which maybe doesn't matter when compared
>>> to the overhead of so many more buffers.
>> Would it be all that much flushing and closing? Yes, we'd need to flush
>> and release the buffers (which I don't think BufFiles can do right now,
>> but let's ignore that for now). But I'd hope the batches are fairly
>> large (because that's why expect to generate them), so one more write
>> should not make a lot of difference on top of the actual bach split.
>> Chances are it's far cheaper than the extra memory pressure due to
>> keeping all batches in memory ...
>>
>> I wonder if it might be a problem that those future batches are "far
>> apart". I mean, we're splitting batch 1, and the tuple may go to batches
>> +64, +128 and +192 batches ahead. If we allow larger "jumps" (e.g. from
>> 256 to 64k batches), it'd be even more visible.
>
> I don't follow. Why would it be a problem if tuples have to go to
> batches that are far away in number?
>
I think you're right it's not a problem, I was just thinking aloud (or
whatever you do in an e-mail).
>>>> Of course, this is a more complex change than the "balancing" patch. But
>>>> maybe not that much, not sure. For me the main disadvantage is it
>>>> doesn't really help with the batch explosion for skewed data sets (or
>>>> data with many hash collisions). It can easily happen we blindly
>>>> increase nbatch until we use all the bits, and then break the work_mem
>>>> limit anyway.
>>>>
>>>> But maybe there's a way to address that - the growthEnabled=false safety
>>>> is an unreliable solution, because it requires the whole batch to fall
>>>> to either of the new batches. A single tuple breaks that.
>>>>
>>>> What if we instead compared the two new batches, and instead looked at
>>>> how far the split is from 1/2? And if it's very far from 1/2, we'd
>>>> either increase work_mem (a bit like the balancing), or disable nbatch
>>>> increases (maybe just temporarily).
>>>
>>> Meaning like have some threshold for the number of tuples over the
>>> limit we are? Right now, we decide to increase batches when we
>>> encounter that one tuple that puts us over the limit. So, I could see
>>> it making sense to decide with more foresight. Or we could even keep
>>> track of the amount over the limit we are and increase the number of
>>> batches once we hit that threshold.
>>>
>>
>> Not sure I understand. I meant that we disable nbatch growth like this:
>>
>> if (nfreed == 0 || nfreed == ninmemory)
>> {
>> hashtable->growEnabled = false;
>> }
>
> Ah, right. I was thinking of the wrong thing.
>
>> which means that it only takes a single tuple that makes it to the other
>> batch to keep growing. But if 99.9999% tuples went to one of the
>> batches, increasing nbatch seems pretty futile.
>
> Right. Yes, that is unfortunate. You could do a percentage threshold.
> Or if we knew how big the biggest batch is, we could decide whether or
> not to disable growth based on the size the hashtable would be for
> that batch vs the overhead of another doubling of nbatches.
>
I was thinking it might be possible to express this as a formula similar
to the "balancing". I mean, something that says "just double as you
wish" when the current doubling split the batch 50:50, but delays the
next doubling if the batch gets split 99:1 (with some continuous
transition between those two extremes).
Or maybe this could also drive increasing the memory limit. Yes, there's
a chance that the next doubling will split it more evenly, but I think
it's much more likely there really are hash collisions of some sort.
>> But it goes in the opposite direction too. Imagine a uniform data set
>> with plenty of distinct values, but correlated / sorted, and each value
>> having more rows that can fit into a single batch. We'll immediately
>> disable growth, which is ... not great.
>>
>> These are somewhat separate / independent issues, but I thin having a
>> concept of "retrying the nbatch growth after a while" would help.
>
> Yes, I think retrying nbatch growth later makes sense in this case. Or
> when doubling nbatches wouldn't help split one rogue batch but would
> help other big batches.
>
Exactly. Giving up the growth entirely seems a bit premature. I don't
think there's a principled formula to determine when to retry, but it
might be enough to try after the hash table doubles in size. That's
pretty much the "let's increase work_mem a bit" I mentioned above.
>>> You could do something like this _and_ do something like close the
>>> files that can't be the target of tuples from the current batch --
>>> which would allow you to tolerate many more batch increases before
>>> doubling the hashtable size is worth it. But it seems like the
>>> algorithm to adapt the hashtable size based on the optimal tradeoff
>>> between hashtable size and number of batches could be done first and
>>> the patch to close files could be done later.
>>
>> Right. I don't think Robert's idea is a a complete answer, because it
>> does not consider the tradeoff that maybe increasing work_mem would be
>> better. OTOH maybe that's not something the hashjoin should worry about.
>> The goal is not to optimize the work_mem value, but make sure we don't
>> use significantly more memory ...
>
> Well it's also not a complete solution because it doesn't solve the
> hash collision/batch explosion case.
>
I think it does, in a way. It doesn't prevent the batch explosion, of
course, it still ends with millions of batches. But it's not keeping all
the BufFiles open at the same time, so it does not use the insane
amounts of memory. And it's slower than the balancing, of course.
>> If hashjoin starts to optimize this, why shouldn't the other places
>> using work_mem do something similar?
>
> Yes, I suppose other spilling operators (like hashagg) that use
> buffered files may consider doing this. But I don't think that is a
> reason not to use this particular strategy to "fix" this hash join
> batch explosion issue.
>
> You could make the argument that because it is the buffers and not the
> actual number of batches that is the problem, that we should fix it by
> closing the files that aren't being used while processing a batch.
> But I really like how small and isolated your sizing balance patch is.
> And I actually think that the fact that it could be used to optimize
> this tradeoff (work_mem/file buffers) in other places is good. Anyway,
> my point was just that we could do both -- likely in any order.
>
Right. The way I'm looking at this is the balancing patch is a strict
improvement over the current state. Robert's proposal is more principled
in that it actually tries to enforce the promised memory limit.
regards
--
Tomas Vondra
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-01-12 00:03:43 | Re: llvm dependency and space concerns |
Previous Message | Christoph Moench-Tegeder | 2025-01-11 23:33:35 | Re: llvm dependency and space concerns |