Re: Adjusting hash join memory limit to handle batch explosion

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-09 23:26:01
Message-ID: 6744911e-0d68-4cf5-9bf9-3dd04d1925f5@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/9/25 23:18, Melanie Plageman wrote:
> On Sun, Jan 5, 2025 at 10:00 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>>
>> I think the general idea and formula explained in [1] is right, but
>> while working on the PoC patch I started to think about how to formalize
>> this. And I ended up creating two tables that I think visualize is
>> pretty nicely.
>>
>> Imagine a table (in the spreadsheet sense), with work_mem values in rows
>> and nbatch values in columns. And the cell is "total memory" used to
>> execute such hash join, i.e.
>>
>> work_mem + (2 * nbatches * 8K)
>>
>> (Yes, there's a multiplier for the hash table size, but I use work_mem
>> for simplicity.) This is what the two attached PDF files show,
>> highlighting two interesting patterns, so let's talk about that.
>>
>>
>> 1) hash-memory-model-1.pdf
>>
>> Imagine you're executing a hash join - you're in a particular cell of
>> the table. And we've reached the current memory limit, i.e. we've filled
>> the hash table, and need to do something. The only solution is to
>> "expand" the expected "total hash size" (nbatch * hash_table_size),
>> which we do by simply doubling the number of batches. And often that's
>> the right thing to do.
>>
>> For example, let's say we're running with work_mem=4MB and nbatch=16,
>> we've filled the hash table and are using 4336kB of memory (a little bit
>> more than work_mem). If we double the number of batches, we may use up
>> to 4352kB of memory in the next cycle. And that's fine.
>
> If you double the number of batches, isn't that an additional 32 files
> with 8kB each -- so 256kB more memory (not 16 kB)?
>
Right, I think that's a typo in my message, not sure where I got the
4352kB. The table has a correct value 4592kB.

>> But hey, there's another way to double the "total hash size" - allowing
>> the in-memory hash table to be twice as large. In the above case, that
>> would be wrong, because doubling work_mem would use up to 8432kB.
>>
>> So in this case it's clearly right to double the number of batches,
>> because that minimizes the total memory used in the next step.
>>
>> However, consider for example the cell with work_mem=4MB, nbatch=8192.
>> We're using 135MB of memory, and need to decide what to do. Doubling the
>> batches means we'll use up to 266MB. But doubling work_mem increases the
>> memory use only to 139MB.
>
> Right, it makes sense to use this as the basis for deciding whether or
> not to increase nbatches.
> >> This is what the green/red in the table means. Green means "better to
>> double nbatch" while red is "better to double work_mem". And clearly,
>> the table is split into two regions, separated by the diagonal.
>>
>> The diagonal is the "optimal" path - if you start in any cell, the
>> red/green decisions will get you to the diagonal, and then along it.
>>
>> The patch [1] aims to do this, but I think this visual explanation is
>> much clearer than anything in that patch.
>
> Yes, the visual is great, thanks!
>

Glad you find it useful too.

>> 2) hash-memory-model-2.pdf
>>
>> I've also asked if maybe the patch should do something about the choice
>> of initial nbatch value, which gets me to the second PDF.
>>
>> Imagine we know the total amount of table in the Hash node is 1GB. There
>> are different ways to split that into batches. If we have enough memory,
>> we could do hash join without batching. With work_mem=1MB we'll need to
>> split this into 1024 batches, or we might do work_mem=2MB with only 512
>> batches. And so on - we're moving along the anti-diagonal.
>>
>> The point is that while this changes the work_mem, this can have pretty
>> non-intuitive impact on total memory use. For example, with wm=1MB we
>> actually use 17MB of memory, while with wm=2MB we use only 10MB.
>>
>> But each anti-diagonal has a minimum - the value on the diagonal. I
>> believe this is the "optimal starting cell" for the hash join. If we
>> don't pick this, the rules explained in (1) will eventually get us to
>> the diagonal anyway.
>
> Makes sense.
>
>> Attached is an "adjust-size" patch implementing this. In the end it has
>> pretty much the same effect as the patch in [1], except that it's much
>> simpler - everything important happens in just two simple blocks, one in
>> ExecChooseHashTableSize(), the other in ExecHashIncreaseNumBatches().
>
> It's interesting -- since the new patch no longer needs to count
> buffile overhead in spaceUsed, spacePeak won't include that overhead.
> And ultimately EXPLAIN uses the spacePeak, right?
>

Right. I think this is a good point - I think it was actually helpful
that the initial patch make this extra memory visible in EXPLAIN. But
without the changes to spacePeak that's no longer the case, so maybe we
should add a separate field or something like that ...

>> There's a bit of complexity, because if we allow growing the size of the
>> in-memory hash table, we probably need to allow increasing number of
>> buckets. But that's not possible with how we split the hashvalue now, so
>> the patch adjusts that by reversing the hashvalue bits when calculating
>> the batch. I'm not sure if this is the best way to do this, there might
>> well be a better solution.
>
> This part is pretty unpleasant looking (reverse_byte array in the
> code). I'll try and think of different ideas. However, I wonder what
> other kinds of effects allowing increasing the number of buckets
> during execution might have?
>

Agreed. It's simply the simplest approach to make the hashing work, I
haven't even tried to measure the overhead. I was looking for some
built-in function to reverse bits etc. but found nothing.

>> I admit all of this seemed a bit weird / wrong initially, because it
>> feels like giving up the memory limit. But the simple truth is that
>> memory limit is pretty much just a lie - the fact that we only show the
>> hash table size in EXPLAIN does not mean we're not using gigabytes more
>> memory, we're just not making it clear. So I'd argue this actually does
>> a better job in limiting memory usage.
>
> I guess people can multiply the number of batches * 8kB to get that
> extra memory overhead. Maybe we should consider putting that in
> EXPLAIN output?
>

Exactly what I suggested above (to add that to EXPLAIN).

Expecting people to realize the batches are backed by batch files and
multiply the number by 8kB didn't quite work, I think. People don't
realize each file has a 8kB buffer, and experienced users/hackers are
surprised by how much memory it quietly consumes.

>> When thinking about reasons why doubling the work_mem might not be the
>> right thing, I can think of one case - CPU caches. IIRC it may be much
>> faster to do the lookups if the hash is small enough to fit into L3, and
>> and doubling this might work against the goal, although I'm not sure how
>> bad the impact may be. In the batch explosion case it surely doesn't
>> matter - the cost of spilling/loading many files is much higher. But for
>> regular (well estimated) cases it might have negative impact.
>>
>> This is why the patch only adjusts the initial parameters in the "red"
>> area, not in the green. Maybe it should be a bit more conservative and
>> only kick in when nbatch value above some threshold.
>
> Wait isn't that the opposite of what you are saying? That is, if we
> want to keep the hashtable fitting in L3, wouldn't we want to allow
> increasing the number of batches even if it uses more memory? That is
> the green area. I see your patch does the red -- increase hashtable
> size and decrease nbatches if it is better. But that seems
> inconsistent with your point about making the hashtable fit in L3.
>

You're right. The "red" area means that we double work_mem, so the hash
table would probably exceed the L3. I don't recall what exactly was my
reasoning, maybe I just didn't think it through.

But I think the L3 benefit likely disappears once we exceed some number
of batches, because batching is fairly expensive. (I haven't measured
this yes, but I find it likely.) That'd mean having some threshold (e.g.
1024 batches), and only apply this new balancing when we exceed it,
would be reasonable.

>> I'd appreciate opinions and alternative ideas about this.
>
> I really like the overall idea about being principled in the number of
> batches vs hashtable size. I think the question about increasing the
> number of buckets and how to do it (during execution) is important to
> figure out a good way of doing.
>

Agreed.

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-01-09 23:27:03 Re: Proposal: add new API to stringinfo
Previous Message Nathan Bossart 2025-01-09 23:19:20 Re: Fix a wrong errmsg in AlterRole()