Re: Adjusting hash join memory limit to handle batch explosion

From: Melanie Plageman <melanieplageman(at)gmail(dot)com>
To: Tomas Vondra <tomas(at)vondra(dot)me>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-10 14:54:54
Message-ID: CAAKRu_aBW=NKGPK=+d_Xc0uk2ci47x7uej3WGUjgQSRLdnFyqw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Jan 9, 2025 at 6:59 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
>
>
> On 1/9/25 21:42, Melanie Plageman wrote:
> >
> > I was excited about your raw file experiment. As Robert and you point
> > out -- we may need a file per batch, but for most of the hash join's
> > execution we don't need to keep buffers for each batch around.
> > However, given that the experiment didn't yield great results and we
> > haven't come up with an alternative solution with sufficiently few
> > flaws, I'm still in favor of 1.
> >
>
> But I think those were two distinct proposals.
>
> My experiment with raw files keeps adding batches just like the current
> code (so it might quickly explode to 1M batches) and then keep feeding
> data to 1M files at the same time. This doesn't work, the buffering
> clearly helps a lot, and it'd affect all hashjoins, even those with
> fewer batches.

I see.

> Robert's idea kept using buffered files, but limited how many we can
> fill at any phase. Say we'd use a limit of 1024 batches, but we actually
> need 1M batches. Then we'd do the build in two phases - we'd generate
> 1024 batches, and then we'd split each of those batches into 1024
> smaller batches. The trick (as I understand it) is those batches can't
> overlap, so we'd not need more than 1024 batches, which greatly limits
> the memory consumption. We could even use a lower limit, derived from
> work_mem or something like that.

I think this is because we get the batch based on

*batchno = pg_rotate_right32(hashvalue, hashtable->log2_nbuckets) &
(nbatch - 1);

And tuples can only spill forward. I think Robert's example is if we
plan for 64 batches and eventually increase to 256 batches, a tuple
assigned to batch 1 could go to 65, 129, or 193 but no other batch --
meaning we would only need 3 files open when processing batch 1. But I
think we would need to do more explicit file flushing and closing and
opening, right? Which maybe doesn't matter when compared to the
overhead of so many more buffers.

> Of course, this is a more complex change than the "balancing" patch. But
> maybe not that much, not sure. For me the main disadvantage is it
> doesn't really help with the batch explosion for skewed data sets (or
> data with many hash collisions). It can easily happen we blindly
> increase nbatch until we use all the bits, and then break the work_mem
> limit anyway.
>
> But maybe there's a way to address that - the growthEnabled=false safety
> is an unreliable solution, because it requires the whole batch to fall
> to either of the new batches. A single tuple breaks that.
>
> What if we instead compared the two new batches, and instead looked at
> how far the split is from 1/2? And if it's very far from 1/2, we'd
> either increase work_mem (a bit like the balancing), or disable nbatch
> increases (maybe just temporarily).

Meaning like have some threshold for the number of tuples over the
limit we are? Right now, we decide to increase batches when we
encounter that one tuple that puts us over the limit. So, I could see
it making sense to decide with more foresight. Or we could even keep
track of the amount over the limit we are and increase the number of
batches once we hit that threshold.

This kind of seems like it would circle back to your algorithm for
deciding on the right tradeoff between hashtable size and number of
batches, though.

You could do something like this _and_ do something like close the
files that can't be the target of tuples from the current batch --
which would allow you to tolerate many more batch increases before
doubling the hashtable size is worth it. But it seems like the
algorithm to adapt the hashtable size based on the optimal tradeoff
between hashtable size and number of batches could be done first and
the patch to close files could be done later.

- Melanie

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Banck 2025-01-10 14:56:31 Re: Adding extension default version to \dx
Previous Message Alena Rybakina 2025-01-10 14:51:10 Re: Vacuum statistics