From: | James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Adjusting hash join memory limit to handle batch explosion |
Date: | 2025-02-25 16:30:17 |
Message-ID: | CAJVSvF6290rJF2MtgSx_SuT9Kn2amZ_+zecoZYMU+dn3BVVaZg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> I've pushed the first (and main) part of the patch series, after some
> more cleanup and comment polishing.
Two comments on your merged patch --
First, it's easier to see what's going on if we overlook the logic to
round to nearest power of two, and solve the optimization problem
algebraically. Let T = the total memory needed to hash all input rows,
and B = the size of per-batch metadata (= 2 * BLKSIZE, which is
typically 16 KB). Then, solving the optimization problem, the minimum
memory usage occurs at n = nbatches = SQRT(T / B) and w = workmem =
SQRT(B * T).
(Here I am using "workmem" for the hash table's "space_allowed.")
The total working memory used, at the minimum, is always 2 * w: twice
the optimal "workmem" ("space_allowed").
This says that the maximum input size that can be (optimally) hashed
with the default 8 MB workmem (= work_mem * hash_mem_multiplier) is 4
GB, and the total working memory used would actually be 16 MB.
Also, to hash 64 GB, or 16x as much, requires a 32 MB workmem, with 64
MB of total working memory used. So "workmem" grows with the SQRT of
T, the total hash memory needed; and total working memory is 2x
"workmem."
Second -- the algebraic solution illustrates the difficulty in
tracking and restricting working memory usage for Hash Joins! Your
patch improves the "hash 64 GB" situation, because it eliminates 96 GB
of per-batch metadata, by reducing n = nbatches from 8192 to 2048, at
a cost of only 24 MB of workmem. Using the default 8 MB workmem,
*actual* total working memory used would be 8 MB + 16 KB * (64 GB / 8
MB) = 136 MB. By increasing workmem to 32 MB, total working memory is
only 64 MB; so we save 72 MB overall. This is a good thing, but--
The "but" is that the customer really should have set their workmem to
64 MB, in the first place; and we should have taken half of that for
the hash table, and left the other half for per-batch metadata.
-- OK, but historically we have pretended that the per-batch metadata
used no memory. So the customer should have set their workmem to 32
MB, with the understanding that PostgreSQL would have actually used 64
MB...
-- OK, but the customer *didn't* set their workmem to 32 MB. (If they
had, we wouldn't need this patch -- but we *do* need this patch, which
means the customer hasn't set their workmem high enough.) Why not?
Well, because if they set it to 32 MB, they'd run OOM!
-- So we are (secretly!) increasing the customer's workmem to 32 MB,
but only for this particular Hash Join. The customer can't increase it
to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
for this Hash Join, in the hopes that by doing so we'll avoid running
OOM... which is good; but we don't *tell* the customer we've done
this, and we just hope that the customer actually has 64 MB (= 2x
workmem) free (because, if they don't, they'll run OOM anyway).
All of this is to say that this patch illustrates the need for
something like proposal [1], which allows PostgreSQL to set workmem
limits on individual execution nodes, based on the optimizer's memory
estimates. In the above patch, we're blindly making things better,
without knowing whether we've made them good enough. (The customer is
less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
possible since their workmem limit is 8 MB!)
In v.next of my patchset at [1] (should be done by end of day today) I
will deal with the case discussed above by:
1. Doubling Plan.workmem_limit whenever we halve nbatches (so we track
the "workmem" needed by the hash table);
2. Displaying Plan.workmem_limit + Hash.nbatches * (2 * BLCKSIZE),
inside EXPLAIN (work_mem on), (so we display to the customer our best
estimate of the effective workmem limit).
Thanks,
James
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2025-02-25 16:31:56 | Re: Trigger more frequent autovacuums of heavy insert tables |
Previous Message | Greg Sabino Mullane | 2025-02-25 16:26:03 | Re: Redact user password on pg_stat_statements |