Re: Adjusting hash join memory limit to handle batch explosion

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-06 21:44:12
Message-ID: f6dcdf97-14c1-4578-9d66-8d0d5d772e0f@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/6/25 19:50, Robert Haas wrote:
> On Mon, Jan 6, 2025 at 11:51 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>> I wonder if maybe a better solution would be to allow BufFiles with
>> smaller buffers, not just hard-coded 8kB. OTOH I'm not sure how much
>> that helps, before the buffering stops being effective as the buffer
>> gets smaller. I mean, we only have 8kB buffer, so if we cut the buffer
>> in half for every nbatch doubling, we'd be down to 1B after 13 rounds
>> (but the buffer is useless once it gets too small to hold multiple
>> tuples, it's only like 5 cycles).
>
> I was more thinking about whether we need to keep all of those buffers
> around all the time. If the number of batches doesn't increase, then
> after we finish moving things into batches they should never need to
> be moved into a different batch. If it does, then things are
> different, but for example if we initially plan on 64 batches and then
> later decide we need 256 batches, we should really only need 3 buffers
> at a time, except for the initial work during batch 0. (In this
> example, a tuple that is initially assigned to batch 1 might need to
> be moved to batch 65, 129, or 193, but it can't need to go anywhere
> else.)
>

Right.

> But I don't quite know how we could avoid needing all the buffers at
> once during batch 0. That said, it's questionable whether it really
> make sense to have an initial number of batches that is very large.
> Does partitioning the input data into 64k batches really make sense,
> or would it be more efficient to partition it 256 ways initially and
> then do a second pass over each of those to split them up another 256
> ways? It's a lot more I/O, but trying to split 64k ways at once is
> presumably going to thrash the File table as well as do a lot of
> completely random physical I/O, so maybe it's worth considering.
>

True, but as soon as you limit the number of batches you could generate,
it's that more or less the same as not enforcing the limit on the amount
of memory consumed by the hash table? Because you have to keep the
tuples that belong to the current batch in memory ...

I suppose you could do this recursively, i.e. split to 256 batches, and
once you can keep the current batch in memory, spill it to disk too. And
then read it from file, and split it into 256 more batches. I think we'd
need to remember the minimum nbatch value for each batch (when it was
created), and then go through all the stages up to current nbatch. But
it could work, I guess.

The thing is - I don't think increasing the work_mem is bad - in fact,
it's exactly the thing that may stop the batch explosion when there are
hash collisions / overlaps. That's what the test script does to trigger
the explosion, although I admit it's an artificial / adversary case. But
similar stuff can happen in PROD, and we blindly increase nbatch when it
can't possibly help, stopping only after running out of hash bits.

> Another thought is that, if we really do want to partition 64k ways
> all at once, maybe 16kb set aside for each batch is not the right
> approach. 64k batches * 16kB/buffer = 1GB, but if we have 1GB of
> memory available for partitioning, wouldn't it make sense to read a
> gigabyte of tuples, sort them by batch #, and then open each file that
> needs to get at least 1 tuple, write all the tuples into that file,
> and close it? This seems more scalable than what we do today, because
> it doesn't require a certain amount of memory per batch. The
> performance might not be great if you're using a really small amount
> of memory for a really large number of batches, but it might still be
> better than the current algorithm, which could easily leave a lot of
> that memory idling a lot of the time.
>

This is pretty much the idea behind the 0002 patch in the "raw files"
PoC patch, although I tried to use a much smaller batch. Maybe with 1GB
(and better coding than in my PoC patch) it would work better.

Still, if we have 1GB for a buffer, maybe it'd be better to use some of
that for a larger hash table, and not need that many batches ...

> Said another way, I think the current algorithm is optimized for small
> numbers of batches. Repeatedly filling and flushing a 16kB buffer
> makes sense if the number of buffers isn't that big so that flushes
> are regular and a buffer is typically going to spend a lot of its time
> approximately half full. But when the number of batches becomes large,
> buffers will start to be flushed less and less often, especially if
> there is skew in the data but to some degree even if there isn't. Any
> buffer that sits there for "a long time" -- whatever that means
> exactly -- without getting flushed is not a good use of memory.
>

Right. FWIW I suspect we had similar discussions for the hashagg.

> I'm just spitballing here. Don't confuse anything in this email with a
> demand for you to do something different than you are.
>

No, thanks. It's good to have these discussions and be forced to think
about it from a different angle.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-01-06 21:50:24 Re: allow changing autovacuum_max_workers without restarting
Previous Message Andres Freund 2025-01-06 21:40:26 Re: AIO v2.2