Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date: 2025-01-22 21:13:24
Message-ID: 2c614cba-4d0e-4a93-9590-35d689dd61ce@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 1/10/25 19:00, James Hunter wrote:
> ...
>
> **Proposal:**
>
> I propose that we add a “query_work_mem” GUC, which works by
> distributing (using some algorithm to be described in a follow-up
> email) the entire “query_work_mem” to the query’s operators. And then
> each operator will spill when it exceeds its own work_mem limit. So
> we’ll preserve the existing “spill” logic as much as possible.
>
> To enable this to-be-described algorithm, I would add an “nbytes”
> field to the Path struct, and display this (and related info) in
> EXPLAIN PLAN. So the customer will be able to see how much work_mem
> the SQL compiler thinks they’ll need, per operator; and so will the
> algorithm.
>
> I wouldn’t change the existing planning logic (at least not in the
> initial implementaton). So the existing planning logic would choose
> between different SQL operators, still on the assumption that every
> operator that needs working memory will get work_mem [*
> hash_mem_multiplier].

All this seems generally feasible, but it hinges on the to-be-described
algorithm can distribute the memory in a sensible way that doesn't
affect the costing too much. If we plan a hash join with nbatch=1, and
then come back and make it to use nbatch=1024, maybe we wouldn't have
used hash join at all. Not sure.

The fundamental issue seems to be not having any information about how
much memory might be available to the operator. And in principle we
can't have that during the bottom-up part of the planning, until after
we construct the whole plan. Only at that point we know how many
operators will need work_mem.

Could we get at least some initial estimate how many such operators the
query *might* end up using? Maybe that'd be just enough a priori
information to set the effective work_mem for the planning part to make
this practical.

> This assumption might understate or overstate
> the actual working memory we’ll give the operator, at runtime. If it
> understates, the planner will be biased in favor of operators that
> don’t use much working memory. If it overstates, the planner will be
> biased in favor of operators that use too much working memory.
>

I'm not quite sure I understand this part. Could you elaborate?

> (We could add a feedback loop to the planner, or even something simple
> like generating multiple path, at different “work_mem” limits, but
> everything I can think of here adds complexity without much potential
> benefit. So I would defer any changes to the planner behavior until
> later, if ever.)
>

What would be the feedback? I can imagine improving the estimate of how
much memory a given operator needs during the bottom-up phase, but it
doesn't quite help with knowing what will happen above the current node.

> The to-be-described algorithm would look at a query’s Paths’ “nbytes”
> fields, as well as the session “work_mem” GUC (which would, now, serve
> as a hint to the SQL compiler), and decide how much of
> “query_work_mem” to assign to the corresponding Plan node.
>
> It would assign that limit to a new “work_mem” field, on the Plan
> node. And this limit would also be exposed, of course, in EXPLAIN
> ANALYZE, along with the actual work_mem usage, which might very well
> exceed the limit. This will let the customer know when a query spills,
> and why.
>
> I would write the algorithm to maintain the existing work_mem
> behavior, as much as possible. (Backward compatibility is good!) Most
> likely, it would treat “work_mem” (and “hash_mem_multiplier”) as a
> *minimum* work_mem. Then, so long as query_work_mem exceeds the sum of
> work_mem [* hash _mem_multiplier] , for all operators in the query,
> all operators would be assigned at least work_mem, which would make my
> proposal a Pareto improvement.
>
> Last, at runtime, each PlanState would check its plan -> work_mem
> field, rather than the global work_mem GUC. Execution would otherwise
> be the same as today.
>
> What do you think?
>

I find it a bit hard to discuss an abstract proposal, without knowing
the really crucial ingredient. It might be helpful to implement some
sort of PoC of this approach, I'm sure that'd give us a lot of insights
and means to experiment with it (instead of just speculating about what
might happen).

regards

--
Tomas Vondra

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-01-22 21:17:03 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Previous Message Nathan Bossart 2025-01-22 20:49:52 Re: 回复:Re: 回复:Re: speed up pg_upgrade with large number of tables