Quick Links

Re: Adjusting hash join memory limit to handle batch explosion

From:	Melanie Plageman <melanieplageman(at)gmail(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Adjusting hash join memory limit to handle batch explosion
Date:	2025-01-09 22:18:49
Message-ID:	CAAKRu_aYJ9gcaZWOSkfarRmq7Y7Z6ppJSpTym=pKTDP-M3LNFw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Jan 5, 2025 at 10:00 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> I think the general idea and formula explained in [1] is right, but
> while working on the PoC patch I started to think about how to formalize
> this. And I ended up creating two tables that I think visualize is
> pretty nicely.
>
> Imagine a table (in the spreadsheet sense), with work_mem values in rows
> and nbatch values in columns. And the cell is "total memory" used to
> execute such hash join, i.e.
>
> work_mem + (2 * nbatches * 8K)
>
> (Yes, there's a multiplier for the hash table size, but I use work_mem
> for simplicity.) This is what the two attached PDF files show,
> highlighting two interesting patterns, so let's talk about that.
>
>
> 1) hash-memory-model-1.pdf
>
> Imagine you're executing a hash join - you're in a particular cell of
> the table. And we've reached the current memory limit, i.e. we've filled
> the hash table, and need to do something. The only solution is to
> "expand" the expected "total hash size" (nbatch * hash_table_size),
> which we do by simply doubling the number of batches. And often that's
> the right thing to do.
>
> For example, let's say we're running with work_mem=4MB and nbatch=16,
> we've filled the hash table and are using 4336kB of memory (a little bit
> more than work_mem). If we double the number of batches, we may use up
> to 4352kB of memory in the next cycle. And that's fine.

If you double the number of batches, isn't that an additional 32 files
with 8kB each -- so 256kB more memory (not 16 kB)?

> But hey, there's another way to double the "total hash size" - allowing
> the in-memory hash table to be twice as large. In the above case, that
> would be wrong, because doubling work_mem would use up to 8432kB.
>
> So in this case it's clearly right to double the number of batches,
> because that minimizes the total memory used in the next step.
>
> However, consider for example the cell with work_mem=4MB, nbatch=8192.
> We're using 135MB of memory, and need to decide what to do. Doubling the
> batches means we'll use up to 266MB. But doubling work_mem increases the
> memory use only to 139MB.

Right, it makes sense to use this as the basis for deciding whether or
not to increase nbatches.

> This is what the green/red in the table means. Green means "better to
> double nbatch" while red is "better to double work_mem". And clearly,
> the table is split into two regions, separated by the diagonal.
>
> The diagonal is the "optimal" path - if you start in any cell, the
> red/green decisions will get you to the diagonal, and then along it.
>
> The patch [1] aims to do this, but I think this visual explanation is
> much clearer than anything in that patch.

Yes, the visual is great, thanks!

> 2) hash-memory-model-2.pdf
>
> I've also asked if maybe the patch should do something about the choice
> of initial nbatch value, which gets me to the second PDF.
>
> Imagine we know the total amount of table in the Hash node is 1GB. There
> are different ways to split that into batches. If we have enough memory,
> we could do hash join without batching. With work_mem=1MB we'll need to
> split this into 1024 batches, or we might do work_mem=2MB with only 512
> batches. And so on - we're moving along the anti-diagonal.
>
> The point is that while this changes the work_mem, this can have pretty
> non-intuitive impact on total memory use. For example, with wm=1MB we
> actually use 17MB of memory, while with wm=2MB we use only 10MB.
>
> But each anti-diagonal has a minimum - the value on the diagonal. I
> believe this is the "optimal starting cell" for the hash join. If we
> don't pick this, the rules explained in (1) will eventually get us to
> the diagonal anyway.

Makes sense.

> Attached is an "adjust-size" patch implementing this. In the end it has
> pretty much the same effect as the patch in [1], except that it's much
> simpler - everything important happens in just two simple blocks, one in
> ExecChooseHashTableSize(), the other in ExecHashIncreaseNumBatches().

It's interesting -- since the new patch no longer needs to count
buffile overhead in spaceUsed, spacePeak won't include that overhead.
And ultimately EXPLAIN uses the spacePeak, right?

> There's a bit of complexity, because if we allow growing the size of the
> in-memory hash table, we probably need to allow increasing number of
> buckets. But that's not possible with how we split the hashvalue now, so
> the patch adjusts that by reversing the hashvalue bits when calculating
> the batch. I'm not sure if this is the best way to do this, there might
> well be a better solution.

This part is pretty unpleasant looking (reverse_byte array in the
code). I'll try and think of different ideas. However, I wonder what
other kinds of effects allowing increasing the number of buckets
during execution might have?

> I admit all of this seemed a bit weird / wrong initially, because it
> feels like giving up the memory limit. But the simple truth is that
> memory limit is pretty much just a lie - the fact that we only show the
> hash table size in EXPLAIN does not mean we're not using gigabytes more
> memory, we're just not making it clear. So I'd argue this actually does
> a better job in limiting memory usage.

I guess people can multiply the number of batches * 8kB to get that
extra memory overhead. Maybe we should consider putting that in
EXPLAIN output?

> When thinking about reasons why doubling the work_mem might not be the
> right thing, I can think of one case - CPU caches. IIRC it may be much
> faster to do the lookups if the hash is small enough to fit into L3, and
> and doubling this might work against the goal, although I'm not sure how
> bad the impact may be. In the batch explosion case it surely doesn't
> matter - the cost of spilling/loading many files is much higher. But for
> regular (well estimated) cases it might have negative impact.
>
> This is why the patch only adjusts the initial parameters in the "red"
> area, not in the green. Maybe it should be a bit more conservative and
> only kick in when nbatch value above some threshold.

Wait isn't that the opposite of what you are saying? That is, if we
want to keep the hashtable fitting in L3, wouldn't we want to allow
increasing the number of batches even if it uses more memory? That is
the green area. I see your patch does the red -- increase hashtable
size and decrease nbatches if it is better. But that seems
inconsistent with your point about making the hashtable fit in L3.

> I'd appreciate opinions and alternative ideas about this.

I really like the overall idea about being principled in the number of
batches vs hashtable size. I think the question about increasing the
number of buckets and how to do it (during execution) is important to
figure out a good way of doing.

- Melanie

In response to

Re: Adjusting hash join memory limit to handle batch explosion at 2025-01-06 02:59:39 from Tomas Vondra

Responses

Re: Adjusting hash join memory limit to handle batch explosion at 2025-01-09 23:26:01 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Jacob Champion	2025-01-09 22:35:06	Re: [PoC] Federated Authn/z with OAUTHBEARER
Previous Message	Daniel Gustafsson	2025-01-09 22:05:46	Re: Moving the vacuum GUCs' docs out of the Client Connection Defaults section