Re: Adjusting hash join memory limit to handle batch explosion

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Adjusting hash join memory limit to handle batch explosion
Date: 2025-01-25 20:41:29
Message-ID: 5c6b1a65-6265-4bc5-9d0e-a4c7d6ff3d62@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Here's a somewhat cleaned up version of the original patch series (with
memory balancing) from [1].

1) v20250125-0001-Balance-memory-usage-with-hashjoin-batch-e.patch
------------------------------------------------------------------

The 0001 patch does exactly the same thing as
vadjust-size-0001-hashjoin-sizing-balance.patch, except that it moves
the code a bit and (hopefully) does a better job at explaining the logic
in the comments.

I'm fairly happy with how simple and non-invasive this is, and how well
it deals the issue. Sure, limiting the number of batch files (and then
splitting them recursively" later) seems possible and perhaps more
"correct" (in the sense that it better enforces the memory limit). But
it's far more invasive, impacts everyone (not just the rare case of
batch explosion), and doesn't help with "indivisible" batches (we still
end up with the batch explosion).

I don't have capacity/interest to continue working on this (limiting the
number of spill files) in the near term, and even if I had I don't think
it'd be doable for PG18, considering there's just one commitfest.

My plan is to get something like 0001 into PG18. It's strictly better
than what we have now, that's for sure, and I think is good enough for
the rare cases of batch explosion.

The one open question I have is what to do about the hashing, and how we
calculate bucket/batch. With the current scheme we can't increase the
number of buckets above nbuckets_optimal, which is sized for the largest
hash_table we can fit into (work_mem * hash_mem_multiplier). But the
patch is based on the idea that at some point it's better to grow the
hash table beyond that limit.

So either we need to change how we split hash, or just accept that if
the hash table grows too much, we may get longer chains. Which I guess
might be still better than having too many batch files.

The patch uses the lookup table algorithm to reverse bits in the hash
from here:

https://graphics.stanford.edu/~seander/bithacks.html#BitReverseTable

And then it takes nbatch from the reversed value, i.e from the beginning
of the hash (while nbuckets is taken from the end as before). I tried
the other algorithms, but they all seemed slower. Another option would
be to start calculating two separate hashes, or a 64-bit hash (and split
it into two). Not sure.

2) v20250125-0002-Postpone-hashtable-growth-instead-of-disab.patch
------------------------------------------------------------------

0002 is an experimental patch to handle another failure I speculated
about, namely "premature disabling of nbatch growth". The theory was
that if the inner relation is correlated, we may disable nbatch growth
prematurely, because the first nbatch increase happens to not split the
batch at all (because of the correlation).

It turned out to be harder to trigger, because it assumes we actually
start with too few batches - i.e. that we significantly underestimate
the inner relation. I'm sure that can easily happen, but the impact
seems to be less severe than for the batch explosion.

There's a SQL script with an example triggering this in 0003.

The patch simply stops using the "true/false" flag, and instead just
doubles the spaceAllowed threshold (a bit like 0001), effectively
postponing next round of nbatch doubling.

This made me realize we already have the issue with nbuckets sizing - if
we disable nbatch growth (be it forever or just temporarily), we
essentially allow the hash table to exceed the expected size. And thus
the nbuckets may be too low. So we'd already need to increase the number
of buckets, it's not just a matter of the 0001 patch.

3) v20250125-0003-hashjoin-patch-tests.patch
--------------------------------------------

This has some files illustrating the memory usage etc. as explained in
the original message [1]. But it also has three scripts to reproduce the
issues.

- patch/batch-explosion.sql - batch explosion
- patch/disabled-growth.sql - disabled growth / correlated data
- patch/disabled-growth2.sql - disabled growth / hash pattern

To use these scripts, copy the hash-collisions.data to /tmp (the scripts
copy the data into a table). And then to \i of the script.

Each of the patches has a GUC to enable the behavior

- enable_hashjoin_adjust
- enable_hashjoin_growth

and by default it's "false" i.e. disabled. So if you want to see the new
behavior, you need to explicitly set it to 'true' before the script.
These GUCs are meant only for easier development and would be removed
from the final commit.

regards

[1]
https://www.postgresql.org/message-id/9e01b538-fb62-4386-b703-548818911702%40vondra.me

--
Tomas Vondra

Attachment Content-Type Size
v20250125-0001-Balance-memory-usage-with-hashjoin-batch-e.patch text/x-patch 14.0 KB
v20250125-0002-Postpone-hashtable-growth-instead-of-disab.patch text/x-patch 5.7 KB
v20250125-0003-hashjoin-patch-tests.patch text/x-patch 691.9 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-01-25 21:50:40 Re: Why we need to check for local buffers in BufferIsExclusiveLocked and BufferIsDirty?
Previous Message Dmitry Dolgov 2025-01-25 19:46:58 Re: System views for versions reporting