Re: Adjusting hash join memory limit to handle batch explosion

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-10 23:09:24
Message-ID: CAAKRu_bSwfEJy9wzn5xwDyAmZh8-cRiBMijEyyCa38bka=wjDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 10, 2025 at 11:18 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 1/10/25 15:54, Melanie Plageman wrote:
> > On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > I think this is because we get the batch based on
> >
> > *batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
> > (nbatch - 1);
> >
> > And tuples can only spill forward. I think Robert's example is if we
> > plan for 64 batches and eventually increase to 256 batches, a tuple
> > assigned to batch 1 could go to 65, 129, or 193 but no other batch --
> > meaning we would only need 3 files open when processing batch 1.
>
> Yes, I think that's why we only need 3 more files when splitting a
> batch. The way I explain it is that going from 64 -> 256 adds 2 more
> bits to the "batchno" part of the batch, and one of the patterns means
> "current batch", so 3 new files.
>
> This does remind me the "one spill file per slice" patch in [1],
> although it approaches it from a different angle. My patch defined the
> "slice" as batches we can keep in work_mem, while Robert proposed to
> decide how many batches we can open (first level of batching), and then
> maybe do that recursively if needed. That seems like a fundamentally
> more sound approach (indeed, my patch can create too many slices).
>
> [1]
> https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
>
> > But I think we would need to do more explicit file flushing and
> > closing and opening, right? Which maybe doesn't matter when compared
> > to the overhead of so many more buffers.
> Would it be all that much flushing and closing? Yes, we'd need to flush
> and release the buffers (which I don't think BufFiles can do right now,
> but let's ignore that for now). But I'd hope the batches are fairly
> large (because that's why expect to generate them), so one more write
> should not make a lot of difference on top of the actual bach split.
> Chances are it's far cheaper than the extra memory pressure due to
> keeping all batches in memory ...
>
> I wonder if it might be a problem that those future batches are "far
> apart". I mean, we're splitting batch 1, and the tuple may go to batches
> +64, +128 and +192 batches ahead. If we allow larger "jumps" (e.g. from
> 256 to 64k batches), it'd be even more visible.

I don't follow. Why would it be a problem if tuples have to go to
batches that are far away in number?

> >> Of course, this is a more complex change than the "balancing" patch. But
> >> maybe not that much, not sure. For me the main disadvantage is it
> >> doesn't really help with the batch explosion for skewed data sets (or
> >> data with many hash collisions). It can easily happen we blindly
> >> increase nbatch until we use all the bits, and then break the work_mem
> >> limit anyway.
> >>
> >> But maybe there's a way to address that - the growthEnabled=false safety
> >> is an unreliable solution, because it requires the whole batch to fall
> >> to either of the new batches. A single tuple breaks that.
> >>
> >> What if we instead compared the two new batches, and instead looked at
> >> how far the split is from 1/2? And if it's very far from 1/2, we'd
> >> either increase work_mem (a bit like the balancing), or disable nbatch
> >> increases (maybe just temporarily).
> >
> > Meaning like have some threshold for the number of tuples over the
> > limit we are? Right now, we decide to increase batches when we
> > encounter that one tuple that puts us over the limit. So, I could see
> > it making sense to decide with more foresight. Or we could even keep
> > track of the amount over the limit we are and increase the number of
> > batches once we hit that threshold.
> >
>
> Not sure I understand. I meant that we disable nbatch growth like this:
>
> if (nfreed == 0 || nfreed == ninmemory)
> {
> hashtable->growEnabled = false;
> }

Ah, right. I was thinking of the wrong thing.

> which means that it only takes a single tuple that makes it to the other
> batch to keep growing. But if 99.9999% tuples went to one of the
> batches, increasing nbatch seems pretty futile.

Right. Yes, that is unfortunate. You could do a percentage threshold.
Or if we knew how big the biggest batch is, we could decide whether or
not to disable growth based on the size the hashtable would be for
that batch vs the overhead of another doubling of nbatches.

> But it goes in the opposite direction too. Imagine a uniform data set
> with plenty of distinct values, but correlated / sorted, and each value
> having more rows that can fit into a single batch. We'll immediately
> disable growth, which is ... not great.
>
> These are somewhat separate / independent issues, but I thin having a
> concept of "retrying the nbatch growth after a while" would help.

Yes, I think retrying nbatch growth later makes sense in this case. Or
when doubling nbatches wouldn't help split one rogue batch but would
help other big batches.

> > You could do something like this _and_ do something like close the
> > files that can't be the target of tuples from the current batch --
> > which would allow you to tolerate many more batch increases before
> > doubling the hashtable size is worth it. But it seems like the
> > algorithm to adapt the hashtable size based on the optimal tradeoff
> > between hashtable size and number of batches could be done first and
> > the patch to close files could be done later.
>
> Right. I don't think Robert's idea is a a complete answer, because it
> does not consider the tradeoff that maybe increasing work_mem would be
> better. OTOH maybe that's not something the hashjoin should worry about.
> The goal is not to optimize the work_mem value, but make sure we don't
> use significantly more memory ...

Well it's also not a complete solution because it doesn't solve the
hash collision/batch explosion case.

> If hashjoin starts to optimize this, why shouldn't the other places
> using work_mem do something similar?

Yes, I suppose other spilling operators (like hashagg) that use
buffered files may consider doing this. But I don't think that is a
reason not to use this particular strategy to "fix" this hash join
batch explosion issue.

You could make the argument that because it is the buffers and not the
actual number of batches that is the problem, that we should fix it by
closing the files that aren't being used while processing a batch.
But I really like how small and isolated your sizing balance patch is.
And I actually think that the fact that it could be used to optimize
this tradeoff (work_mem/file buffers) in other places is good. Anyway,
my point was just that we could do both -- likely in any order.

- Melanie

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Abhishek Chanda 2025-01-10 23:16:44 Re: Adding support for SSLKEYLOGFILE in the frontend
Previous Message Daniel Gustafsson 2025-01-10 22:59:51 Re: Moving the vacuum GUCs' docs out of the Client Connection Defaults section