From: | James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | Melanie Plageman <melanieplageman(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Adjusting hash join memory limit to handle batch explosion |
Date: | 2025-02-25 18:02:01 |
Message-ID: | CAJVSvF7oKewgJ+Wzj6SsxPpcWi6DxyUVDSBV3-Oi6dzMrSrBRQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Feb 25, 2025 at 9:39 AM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 2/25/25 17:30, James Hunter wrote:
> > On Wed, Feb 19, 2025 at 12:22 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
> > -- OK, but the customer *didn't* set their workmem to 32 MB. (If they
> > had, we wouldn't need this patch -- but we *do* need this patch, which
> > means the customer hasn't set their workmem high enough.) Why not?
> > Well, because if they set it to 32 MB, they'd run OOM!
> >
>
> Not sure I follow the reasoning here :-( If the query completed with a
> lower work_mem value, it should complete with work_mem = 32MB, because
> that reduces the amount of memory needed. But yes, it's possible they
> hit OOM in both cases, it's an attempt to reduce the impact.
Yes, your patch is a Pareto improvement, because it means we use less
working memory than we would otherwise.
> > -- So we are (secretly!) increasing the customer's workmem to 32 MB,
> > but only for this particular Hash Join. The customer can't increase it
> > to 32 MB for all Hash Joins, or they'd run OOM. So we increase it just
> > for this Hash Join, in the hopes that by doing so we'll avoid running
> > OOM... which is good; but we don't *tell* the customer we've done
> > this, and we just hope that the customer actually has 64 MB (= 2x
> > workmem) free (because, if they don't, they'll run OOM anyway).
> >
>
> Right. This is meant to be a best-effort mitigation for rare cases.
>
> Maybe we should track/report it somehow, though. I mean, if 1% of hash
> joins need this, you're probably fine. If 99% hash joins hit it, you
> probably really need a higher work_mem value because the hashed relation
> is just too large.
>
> But you have a point - maybe we should track/report this somewhere.
> First step would be to make the total memory usage better visible in
> explain (it's not obvious it does not include the per-batch metadata).
Right -- my point is that mitigation is good, but tracking /
visibility is also necessary.
> > All of this is to say that this patch illustrates the need for
> > something like proposal [1], which allows PostgreSQL to set workmem
> > limits on individual execution nodes, based on the optimizer's memory
> > estimates. In the above patch, we're blindly making things better,
> > without knowing whether we've made them good enough. (The customer is
> > less likely to run OOM using 64 MB instead of 136 MB, but OOM is still
> > possible since their workmem limit is 8 MB!)
> >
> > In v.next of my patchset at [1] (should be done by end of day today) I
> > will deal with the case discussed above by:
> > ...
> I'm not opposed to doing something like this, but I'm not quite sure how
> could it help the cases I meant to address with my patch, where we plan
> with low nbatch value, and then it explodes as execution time.
Your patch addresses two cases: (1) where we plan with high nbatch
value; and (2) where we plan with low nbatch value, and then it
explodes at execution time.
Case (2) can be solved only by taking some action at runtime, and that
action is always best effort (because if we really don't have enough
extra memory free, we have to run OOM). Once a query has started
running, we have fewer options.
Case (1) can be solved in various ways, because it occurs before we
started running the query. For example, we can:
1. Delay starting execution of the query until enough memory becomes
available; or
2. Take memory away from other execution nodes to give to this Hash Join.
But these case (1) solutions require access to the actual
working-memory calculation. That's all I'm saying -- by tracking our
"best-effort" decision, we make it possible to address case (1).
(Your patch solves case (2), as well as it can be solved, by giving
memory to this Hash Join at runtime, and hoping that no one else was
using it. It's hard to improve on that, because PG execution nodes
don't, in general, have the ability to give up memory after they've
started running. But we can do better for case (1), if other
components can basically see the results of your formula.)
Thanks,
James
From | Date | Subject | |
---|---|---|---|
Next Message | Nathan Bossart | 2025-02-25 18:05:10 | Re: Trigger more frequent autovacuums of heavy insert tables |
Previous Message | Melanie Plageman | 2025-02-25 17:58:47 | Re: Parallel heap vacuum |