Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date: 2025-01-22 20:48:51
Message-ID: f3fd3e84-b66d-4f47-992a-86a625ffb000@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/21/25 22:26, Jeff Davis wrote:
> On Fri, 2025-01-10 at 10:00 -0800, James Hunter wrote:
>> How should “query_work_mem” work? Let’s start with an example:
>> suppose
>> we have an OLAP query that has 2 Hash Joins, and no other operators
>> that use work_mem.
>
> So we plan first, and then assign available memory afterward? If we do
> it that way, then the costing will be inaccurate, because the original
> costs are based on the original work_mem.
>
> It may be better than killing the query, but not ideal.
>
>> -- Second, we could say, instead: the small Hash Join is *highly
>> unlikely* to use > 1 MB, so let’s just give both Hash Joins 1023 MB,
>> expecting that the small Hash Join won’t use more than 1 MB of its
>> 1023 MB allotment anyway, so we won’t run OOM. In effect, we’re
>> oversubscribing, betting that the small Hash Join will just stay
>> within some smaller, “unenforced” memory limit.
>>
>> In this example, this bet is probably fine — but it won’t work in
>> general. I don’t want to be in the business of gambling with customer
>> resources: if the small Hash Join is unlikely to use more than 1 MB,
>> then let’s just assign it 1 MB of work_mem.
>
> I like this idea. Operators that either know they don't need much
> memory, or estimate that they don't need much memory, can constrain
> themselves. That would protect against misestimations and advertise to
> the higher levels of the planner how much memory the operator actually
> wants. Right now, the planner doesn't know which operators need a lot
> of memory and which ones don't need any significant amount at all.
>

I'm not sure I like the idea that much.

At first restricting the operator to the amount the optimizer predicts
will be needed seems reasonable, because that's generally the best idea
of memory usage we have without running the query.

But these estimates are often pretty fundamentally unreliable - maybe
not for simple examples, but once you put an aggregate on top of a join,
the errors can be pretty wild. And allowing the operator to still use
more work_mem makes this more adaptive. I suspect forcing the operator
to adhere to the estimated work_mem might make this much worse (but I
haven't tried, maybe spilling to temp files is not that bad).

> The challenge, of course, is what the higher levels of the planner
> would do with that information, which goes to the rest of your
> proposal. But tracking the information seems very reasonable to me.
>

I agree. Tracking additional information seems like a good idea, but
it's not clear to me what would the planner use this. I can imagine
various approaches - e.g. we might do the planning as usual and then
distribute the query_work_mem between the nodes in proportion to the
estimated amount of memory. But it all seems like a very ad hoc
heuristics, and easy to confuse / make the wrong decision.

>> I propose that we add a “query_work_mem” GUC, which works by
>> distributing (using some algorithm to be described in a follow-up
>> email) the entire “query_work_mem” to the query’s operators. And then
>> each operator will spill when it exceeds its own work_mem limit. So
>> we’ll preserve the existing “spill” logic as much as possible.
>
> The description above sounds too "top-down" to me. That may work, but
> has the disadvantage that costing has already happened. We should also
> consider:
>
> * Reusing the path generation infrastructure so that both "high memory"
> and "low memory" paths can be considered, and if a path requires too
> much memory in aggregate, then it would be rejected in favor of a path
> that uses less memory. This feels like it fits within the planner
> architecture the best, but it also might lead to a path explosion, so
> we may need additional controls.
>
> * Some kind of negotiation where the top level of the planner finds
> that the plan uses too much memory, and replans some or all of it. (I
> think is similar to what you described as the "feedback loop" later in
> your email.) I agree that this is complex and may not have enough
> benefit to justify.
>

Right, it seems rather at odds with the bottom-up construction of paths.
The amount of memory an operator may use seems like a pretty fundamental
information, but if it's available only after the whole plan is built,
that seems ... not great.

I don't know if generating (and keeping) low/high-memory paths is quite
feasible. Isn't that really a continuum for many paths? A hash join may
need very little memory (with batching) or a lot of memory (if keeping
everything in memory), so how would this work? Would we generate paths
for a range of work_mem values (with different costs)?

regards

--
Tomas Vondra

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-01-22 20:49:52 Re: 回复:Re: 回复:Re: speed up pg_upgrade with large number of tables
Previous Message Tom Lane 2025-01-22 20:29:04 Re: [PATCH] Add roman support for to_number function