From: | Tomas Vondra <tomas(at)vondra(dot)me> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Adjusting hash join memory limit to handle batch explosion |
Date: | 2024-12-31 23:06:59 |
Message-ID: | 7bed6c08-72a0-4ab9-a79c-e01fcdd0940f@vondra.me |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
I've been once again reminded of the batch explosion issue in hashjoin,
due to how it enforces the memory limit. This resurfaces every now and
then, when a used gets strange OOM issues - see for example these
threads from ~2019 for an example, and even some patches: [1] [2] [3]
Let me restart the discussion, resubmit some of the older patches, and
present a plan for what to do about this ...
Just to remind the basic details, a brief summary - the hashjoin does
not account for the spill files when enforcing the memory limit. The
hash table gets full, it decides to double the number of batches which
cuts the hash table size in half. But with enough batches the doubling
can actually make the situation much worse - the new batches simply use
more memory than was saved.
This can happen for various reasons. A simple example is that we under
estimate the size of the input relation, so the hash needs to be built
on many more tuples. This is bad, but usually not disastrous.
It's much worse when there's a batch that is not "divisible", i.e.
adding more batches does not split it roughly in half. This can happen
due to hash collisions (in the part that determines the batch),
duplicate values that didn't make it into MCV (and thus the skew
optimization does not kick in).
This is fairly rare, but when it happens it can easily lead to batch
explosion, i.e. rapidly increasing the number of batches. We add
batches, but the batch does not split, so we promptly hit the limit
again, triggering another increase. It often stops only when we exhaust
the 32-bit hash space, ending with 100s of thousands of batches.
Attached is a SQL script that reproduces something like this. It builds
a table with values with hashes that have 0s in the upper bits. And then
the hash join just spirals into a batch explosion.
Note: The script is a bit dumb and needs a lot of temp space (~50GB)
when generating the values with duplicate hashes.
In 2019 I shared a bunch of patches [4] improving this, but then I got
distracted and the discussion stalled because there were proposals to
fix this by introducing a special hash join "mode" to address these
issues [5], but we never got past a prototype and there's a lot of
outstanding questions.
So I decided to revisit the three patches from 2019. Attached are
rebased and cleaned up versions. A couple comments on each one:
1) v20241231-adjust-limit-0001-Account-for-batch-files-in-ha.patch
I believe this is the way to go, for now. The basic idea is to keep the
overall behavior, but "relax" the memory limit as the number of batches
increases to minimize the total memory use.
This may seem a bit weird, but as the number of batches grows there's no
way to not violate the limit. And the current code simply ignores this
and allocates arbitrary amounts of memory.
2) v20241231-single-spill-0001-Hash-join-with-a-single-spill.patch
The basic idea is that we keep only a small "slice" of batches in
memory, and data for later batches are spilled into a single file. This
means that even if the number of batches increases, the memory use does
not change. Which means the memory limit is enforced very strictly.
The problem is this performs *terribly* because it shuffles data many
times, always just to the next slice. So if we happen to have 128
batches in memory and the number explodes to ~128k batches, we end up
reading/writing each tuple ~500x.
3) v20241231-multi-spill-0001-Hash-join-with-a-multiple-spil.patch
This is an improvement of the "single spill", except that we keep one
spill file per slice, which greatly reduces the amount of temporary
traffic. The trouble is this means we can no longer enforce the memory
limit that strictly, because the number of files does grow with the
number of batches, although not 1:1. But with a slice of 128 batches we
get only 1 file per 128 batches, which is a nice reduction.
This means that ultimately it's either (1) or (3), and the more I've
been looking into this the more I prefer (1), for a couple reasons:
* It's much simpler (it doesn't really change anything on the basic
behavior, doesn't introduce any new files or anything like that.
* There doesn't seem to be major difference in total memory consumption
between the two approaches. Both allow allocating more memory.
* It actually helps with the "indivisible batch" case - it relaxes the
limit, so there's a chance the batch eventually fits and we stop adding
more and more batches. With spill files that's not the case - we still
keep the original limit, and we end up with the batch explosion (but
then we handle it much more efficiently).
Unless there are some objections, my plan is to get (1) cleaned up and
try to get it in for 18, possibly in the January CF. It's not a
particularly complex patch, and it already passes check-world (it only
affected three plans in join_hash, and those make sense I think).
One thing I'm not sure about yet is whether this needs to tweak the
hashjoin costing to also consider the files when deciding how many
batches to use. Maybe it should?
regards
[1]
https://www.postgresql.org/message-id/20190504003414.bulcbnge3rhwhcsh%40development
[2] https://www.postgresql.org/message-id/20230228190643.1e368315%40karst
[3]
https://www.postgresql.org/message-id/bc138e9f-c89e-9147-5395-61d51a757b3b%40gusw.net
[4]
https://www.postgresql.org/message-id/20190428141901.5dsbge2ka3rxmpk6%40development
--
Tomas Vondra
Attachment | Content-Type | Size |
---|---|---|
v20241231-adjust-limit-0001-Account-for-batch-files-in-ha.patch | text/x-patch | 12.0 KB |
v20241231-multi-spill-0001-Hash-join-with-a-multiple-spil.patch | text/x-patch | 27.1 KB |
v20241231-single-spill-0001-Hash-join-with-a-single-spill.patch | text/x-patch | 24.8 KB |
hash.sql | application/sql | 612 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Thomas Munro | 2024-12-31 23:24:10 | Re: Strange issue with NFS mounted PGDATA on ugreen NAS |
Previous Message | Tom Lane | 2024-12-31 22:44:29 | Re: Strange issue with NFS mounted PGDATA on ugreen NAS |