Quick Links

Re: Adjusting hash join memory limit to handle batch explosion

From:	Tomas Vondra <tomas(at)vondra(dot)me>
To:	PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Adjusting hash join memory limit to handle batch explosion
Date:	2025-01-06 02:59:39
Message-ID:	9e01b538-fb62-4386-b703-548818911702@vondra.me
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

I kept thinking about this, thinking about alternative approaches, and
also about how hashjoin limits memory in general.

First, I want to discuss one thing I tried, but I think it does not
really work. The annoying part about the "memory rebalance" patch is
that it relaxes the memory limit. However, the memory limit is a lie,
because we enforce it by adding batches, and that unfortunately is not
free - each batch is a BufFile with BLCKSZ buffer, so while we may
succeed in keeping the hash table in work_mem, we may end up with the
batches using gigabytes of memory - which is not reported in EXPLAIN,
but it's still allocated. This does happen even without the hash
explosion, the hash explosion is just an extreme version.

There's no way to work around this ... as long as we use BufFiles. What
if we used plain File(s), without the buffering? Then the per-batch cost
would be considerably lower. Of course, this would be acceptable only if
not having the buffering has acceptable impact on performance. That
seemed unlikely, but I decided to give it a try - see a PoC of this in
the attached files-poc-patches.tgz (patch 0001).

Unfortunately, the impact does not seem acceptable - it does enforce the
limit, but the lack of buffering does make a huge difference, making it
~2x slower depending on the query.

I experimented a bit with a cross-file buffer, much smaller than the sum
of BufFile buffers, but still allowing combining writes into larger
chunks. Imagine you have a buffer large enough for (nbatch * 2) tuples,
then we may expect writing two tuples at once into each batch file.

The 0002 patch in the PoC series tries to do this, but it does not
really help, and in some cases it's doing even worse than 0001, because
the cost of maintaining the shared buffer increases with the number of
batches. I'm sure some of this is my fault, and it could be improved and
optimized quite a bit.

I decided not to do that, because this experiment made me realize that:

a) The buffer would need to grow with the number of batches, to have any
chance of combining the writes. If we want to combine K tuples into a
single write (on average), we'd need the buffer to keep (nbatches * K)
tuples, and once the nbatches gets large (because who cares about
BufFile allocations with a handful of batches), that may be a lot of
memory. Say we get to 32k batches (which is not that hard), and we want
to keep 16 tuples, each 128B, that's ~64MB. Not a huge amount, and much
less than the 512MB we'd need for batches. But still, it means we're not
really enforcing the memory limit - which was the point of using files
without buffering.

b) It does enforce the limit on the hash table itself, though. And that
is actually not great, because it means it can't possibly help with the
batch explosion, caused by a single "indivisible" batch.

c) It's pretty invasive.

So I still think adjusting the memory as we're adding batches seems like
a better approach. The question is where to do the adjustments, based on
what logic ...

I think the general idea and formula explained in [1] is right, but
while working on the PoC patch I started to think about how to formalize
this. And I ended up creating two tables that I think visualize is
pretty nicely.

Imagine a table (in the spreadsheet sense), with work_mem values in rows
and nbatch values in columns. And the cell is "total memory" used to
execute such hash join, i.e.

work_mem + (2 * nbatches * 8K)

(Yes, there's a multiplier for the hash table size, but I use work_mem
for simplicity.) This is what the two attached PDF files show,
highlighting two interesting patterns, so let's talk about that.

1) hash-memory-model-1.pdf

Imagine you're executing a hash join - you're in a particular cell of
the table. And we've reached the current memory limit, i.e. we've filled
the hash table, and need to do something. The only solution is to
"expand" the expected "total hash size" (nbatch * hash_table_size),
which we do by simply doubling the number of batches. And often that's
the right thing to do.

For example, let's say we're running with work_mem=4MB and nbatch=16,
we've filled the hash table and are using 4336kB of memory (a little bit
more than work_mem). If we double the number of batches, we may use up
to 4352kB of memory in the next cycle. And that's fine.

But hey, there's another way to double the "total hash size" - allowing
the in-memory hash table to be twice as large. In the above case, that
would be wrong, because doubling work_mem would use up to 8432kB.

So in this case it's clearly right to double the number of batches,
because that minimizes the total memory used in the next step.

However, consider for example the cell with work_mem=4MB, nbatch=8192.
We're using 135MB of memory, and need to decide what to do. Doubling the
batches means we'll use up to 266MB. But doubling work_mem increases the
memory use only to 139MB.

This is what the green/red in the table means. Green means "better to
double nbatch" while red is "better to double work_mem". And clearly,
the table is split into two regions, separated by the diagonal.

The diagonal is the "optimal" path - if you start in any cell, the
red/green decisions will get you to the diagonal, and then along it.

The patch [1] aims to do this, but I think this visual explanation is
much clearer than anything in that patch.

2) hash-memory-model-2.pdf

I've also asked if maybe the patch should do something about the choice
of initial nbatch value, which gets me to the second PDF.

Imagine we know the total amount of table in the Hash node is 1GB. There
are different ways to split that into batches. If we have enough memory,
we could do hash join without batching. With work_mem=1MB we'll need to
split this into 1024 batches, or we might do work_mem=2MB with only 512
batches. And so on - we're moving along the anti-diagonal.

The point is that while this changes the work_mem, this can have pretty
non-intuitive impact on total memory use. For example, with wm=1MB we
actually use 17MB of memory, while with wm=2MB we use only 10MB.

But each anti-diagonal has a minimum - the value on the diagonal. I
believe this is the "optimal starting cell" for the hash join. If we
don't pick this, the rules explained in (1) will eventually get us to
the diagonal anyway.

A different visualization is in the attached SVG, which is a surface
plot / heat map of the total memory use. It shows that there really is a
"valley" of minimal values on the diagonal, and that the growth for
doubling batches is much steeper than for doubling work_mem.

Attached is an "adjust-size" patch implementing this. In the end it has
pretty much the same effect as the patch in [1], except that it's much
simpler - everything important happens in just two simple blocks, one in
ExecChooseHashTableSize(), the other in ExecHashIncreaseNumBatches().

There's a bit of complexity, because if we allow growing the size of the
in-memory hash table, we probably need to allow increasing number of
buckets. But that's not possible with how we split the hashvalue now, so
the patch adjusts that by reversing the hashvalue bits when calculating
the batch. I'm not sure if this is the best way to do this, there might
well be a better solution.

I admit all of this seemed a bit weird / wrong initially, because it
feels like giving up the memory limit. But the simple truth is that
memory limit is pretty much just a lie - the fact that we only show the
hash table size in EXPLAIN does not mean we're not using gigabytes more
memory, we're just not making it clear. So I'd argue this actually does
a better job in limiting memory usage.

When thinking about reasons why doubling the work_mem might not be the
right thing, I can think of one case - CPU caches. IIRC it may be much
faster to do the lookups if the hash is small enough to fit into L3, and
and doubling this might work against the goal, although I'm not sure how
bad the impact may be. In the batch explosion case it surely doesn't
matter - the cost of spilling/loading many files is much higher. But for
regular (well estimated) cases it might have negative impact.

This is why the patch only adjusts the initial parameters in the "red"
area, not in the green. Maybe it should be a bit more conservative and
only kick in when nbatch value above some threshold.

I'd appreciate opinions and alternative ideas about this.

I'm also attaching the data + SQL script I use to trigger the batch
explosion with up to 2M batches.

regards

[1]
https://www.postgresql.org/message-id/7bed6c08-72a0-4ab9-a79c-e01fcdd0940f%40vondra.me

--
Tomas Vondra

Attachment	Content-Type	Size
vadjust-size-0001-hashjoin-sizing-balance.patch	text/x-patch	5.3 KB
hash-memory-model-1.pdf	application/pdf	16.6 KB
hash-memory-model-2.pdf	application/pdf	16.9 KB
hashjoin.svg	image/svg+xml	100.5 KB
files-poc-patches.tgz	application/x-compressed-tar	6.0 KB
test.sql	application/sql	1.2 KB
hash-collisions.data.gz	application/gzip	93.2 KB

In response to

Adjusting hash join memory limit to handle batch explosion at 2024-12-31 23:06:59 from Tomas Vondra

Responses

Re: Adjusting hash join memory limit to handle batch explosion at 2025-01-06 15:42:16 from Robert Haas
Re: Adjusting hash join memory limit to handle batch explosion at 2025-01-09 22:18:49 from Melanie Plageman

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Amit Kapila	2025-01-06 03:15:09	Re: Conflict detection for update_deleted in logical replication
Previous Message	Hayato Kuroda (Fujitsu)	2025-01-06 02:29:53	RE: Log a warning in pg_createsubscriber for max_slot_wal_keep_size