Quick Links

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

From:	James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To:
Cc:	"pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date:	2025-02-24 20:46:23
Message-ID:	CAJVSvF5S89XNaH1Pg40DffXD3HEFKxEe8Rew3HviyymOYp5X+w@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Attached please find the patch set I mentioned, above, in [1]. It
consists of 4 patches that serve as the building blocks for and a
prototype of the "query_work_mem" GUC I proposed:

* Patch 1 captures the optimizer’s estimate of how much working memory
a particular Plan node would need, to avoid spilling, and stores this
on the Plan, next to cost, etc. It also adds a new “work_mem on”
option to the EXPLAIN command, to display this working-memory
estimate. This “work_mem on” estimate gives the customer a sense of
how much working memory a particular query will actually use, and also
enables an extension (e.g., Patch 4), to assign working-memory limits,
per exec node, intelligently.

Patch 1 doesn't change any user-visible behavior, except for
displaying workmem estimates via EXPLAIN, when the new "work_mem on"
option is specified.

* Patch 2 is a prerequisite for Patches 3 and 4. It maintains a
subPlan list on the Plan node, next to the existing initPlan list, to
store (pointers to) regular SubPlans.

The existing initPlan list is needed, because otherwise there’s no way
to find the particular SubPlan; but this new subPlan list hasn’t been
needed before now, because every SubPlan on the list appears somewhere
inside the Plan node’s expressions. The subPlan list is needed now,
however, because a SubPlan can use working memory (if it maintains one
or two hash tables). So, we need a way to find this SubPlan, so we can
set its working-memory limit; and it doesn’t make sense to walk
through all the Plan node’s expressions, a second time, after we’ve
finalized the plan.

Instead, Patch 2 copies regular SubPlans to this new list, inside
setrefs.c, so we can find them and assign working memory to them,
later.
Patch 3 modifies all existing exec nodes to read their working-memory
limit off their Plan, rather than off the GUC. It adds a new function,
ExecAssignWorkMem(), which gets called from InitPlan(), immediately
before we start calling ExecInitNode(). This way, the executor could
assign different working-memory limits, based on global memory
conditions; but this patch just preserves existing behavior, and
copies these limits from the GUCs.

Patch 2 doesn't change any user-visible behavior -- it just adds some
internal bookkeeping.

* Patch 3 extends the new “work_mem on” EXPLAIN option, further, to
show the working-memory limit. This is the limit already imposed by
PostgreSQL's work_mem and hash_mem_multiplier GUCs. Patch 3 copies
this limit from these GUCs, onto a new field stored on the Plan
object. It then modifies "EXPLAIN (work_mem on)" to read this limit
off the Plan object and display it.

Other than this change to EXPLAIN, Patch 3 doesn't change any
user-visible behavior.

* Patch 4, finally!, adds a hook to allow extensions to override
ExecAssignWorkMem(). It also adds an extension, “workmem,” that
implements this hook and assigns working memory to individual
execution nodes, based on a new workmem.query_work_mem GUC. This
extension prevents queries from exceeding workmem.query_work_mem,
while also handing out extra memory in the case where the query limit,
from Patch 3, is < workmem.query_work_mem.

In this way, Patch 4 avoids either undersubscribing or oversubscribing
working memory for queries, which is the goal of my proposal.

Discussion follows--

A few operators currently do not honor their working-memory limits by
spilling; these operators use tuple hash tables — which don’t spill —
without implementing their own “spill” logic. I would address these
operators in a subsequent release. Note that Hash Agg and Hash Join
both spill, as expected, so the major cases already work.

I store the working-memory estimate on both Path and Plan objects.
Keeping with PostgreSQL convention that a Path is an abstraction of
one or more Plan nodes, the Path’s working-memory estimate is “total,”
while the Plan’s is “per data structure.” So, if a SQL operator
requires 2 sort buffers, the Path’s working-memory estimate will be 2x
the Plan’s.

The Plan’s estimate is “per data structure,” because it will be used
to determine the data structure’s working-memory limit. Note that
every operator (save one) currently treats work_mem [*
hash_mem_multiplier] as a per-structure limit, rather than a
per-operator limit. (The exception is Hash Agg, where all of the
node’s hash tables share the same memory limit; and so I have
preserved this behavior in the Hash Agg’s workmem and workmem_limit
fields.)

The Plan’s workmem estimate logically belongs on the Plan object (just
as the Path’s workmem logically belongs on the Path), while the
workmem_limit logically belongs on the PlanState. This is why
workmem_limit is set inside InitPlan() — it’s an execution-time limit,
not a plan-time limit.

However, the workmem_limit is stored, physically, on the Plan object,
not the PlanState. This is to avoid a chicken-and-egg problem: (a) The
PlanState is not created until ExecInitNode(); but, (b) ExecInitNode()
also typically creates the node’s data structures, sized to
workmem_limit.

So we need a way to set workmem_limit after the Plan has been
finalized, but before any exec nodes are initialized. Accordingly, we
set this field on the Plan object, with the understanding that it
doesn’t “really” belong there.

A nice consequence of storing workmem_limit on the Plan object, rather
than the PlanState, is that the limit automatically gets
serialized+deserialized to parallel workers. This simplifies Patch 3 a
little bit, since we can avoid executing ExecWorkMem() on parallel
workers; but it really benefits Patch 4, because it allows the
ExecAssignWorkMem_hook to set a memory limit on the query, regardless
of the number of parallel workers that get spawned at runtime.

Notes about individual SQL operators follow--

Patch 1 reuses existing optimizer logic, as much as possible, to
calculate “workmem” — rounded up to the nearest KB, and with a minimum
of 64 KB. (The 64 KB minimum is because that’s the smallest a customer
can set the work_mem GUC, so it seems possible that some SQL operators
rely on the assumption that they’ll always get >= 64 KB of working
memory.)

The PostgreSQL operators that use working memory can be placed into
three categories:

1. Operators that use working memory, and which also cost the
possibility of spilling. For these operators, Patch 1 just reports the
“nbytes” estimate that the optimizer already produces.
1a. Sort and IncrementalSort (via cost_tuplesort()).
1b. HashJoin (via ExecChooseHashTableSize()).
1c. Material (via cost_material()).
1d. Unique (via either cost_sort() or cost_agg()).
1e. Grouping Sets (via create_groupingsets_path(), which calls
cost_sort() and cost_agg()).

NOTE: Grouping Sets can end up creating a chain of Agg plan nodes,
each of which gets its own working-memory budget. Discussed below.

1f. Agg (via cost_agg()).

NOTE: Discussed below.

1g. SetOp (via create_setop_path()).

NOTE: A SetOp never spills; however, existing logic disables the SetOp
“if it doesn't look like the hashtable will fit into hash_mem.” It
assumes the hash entry size is: MAXALIGN(leftpath->pathtarget->width)
+ MAXALIGN(SizeofMinimalTupleHeader).

2. Operators that use working memory, but which do not currently cost
the possibility of spilling, because the existing estimate is assumed
to be unreliable. For these operators, Patch 1 just reports an
“unreliable” estimate.
2a. FunctionScan .
2b. TableFuncScan .

3. Remaining operators that use working memory, but for whatever
reason do not currently cost the possibility of spilling. For these
operators, Patch 1 just computes and reports an estimate, based on
logic appearing elsewhere in the code.

3a. RecursiveUnion. (Uses two Tuplestores, and possibly a
TupleHashTable.) Patch 1 uses nrterm to estimate one of the
Tuplestores; rterm to estimate the second Tuplestore; and (if
relevant) numGroups to estimate # of hash buckets.
3b. CteScan (but *not* WorkTableScan). Relies on cost_ctescan().)
Patch 1 just uses rows * width, since the output is materialized into
a Tuplestore.
3c. Memoize . Patch 1 uses ndistinct to estimate # of hash buckets.
3d. WindowAgg . Patch 1 uses startup_tuples to estimate # of tuples
materialized in the Tuplestore.
3e. BitmapHeapScan. Although the TID bitmaps created by the
bitmapqual’s BitmapIndexScan nodes are limited to work_mem, these
bitmaps lossify rather than spill. Patch 1 applies the inverse of
tbm_calculate_entries() to the expected number of heap rows, produced
by the optimizer.
3f. SubPlan, if it requires a hash table (and possibly a hash-NULL
table). Patch 1 uses rows and rows / 16, respectively, copying the
existing logic in nodeSubplan.c and subselect.c.

NOTE: Since we don’t display SubPlans directly, in EXPLAIN, Patch 1
includes this working-memory estimate along with the SubPlan’s parent
Plan node.

Final comments --

I think the attached patch-set is useful, by itself; but it also
serves as a necessary building block for future work to manage query
working-memory dynamically. For example, the optimizer could be
enhanced to trade off between a high-memory + low cost plan, and a
low-memory + high cost plan. The execution-time extension could be
enhanced to adjust its query-working-memory limit based on current,
global memory usage.

And individual exec nodes could be enhanced to request additional
working-memory, via hook, if they discover they need to spill
unexpectedly. (For example, this would be useful for serial Hash
Joins.)

Question / comments / suggestions / issues / complaints?
Thanks,
James Hunter

[1] https://www.postgresql.org/message-id/flat/CAJVSvF5kMi1-fwBDSv-9bvUjm83zUNEnL95B0s%2Bi08sKDL5-mA%40mail.gmail.com#099d95841bcf533ccdd51f3708bcebc7

Attachment	Content-Type	Size
v01_0001-EXPLAIN-now-takes-work_mem-option-to-display-estimat.patch	application/octet-stream	93.5 KB
v01_0002-Store-non-init-plan-SubPlan-objects-in-Plan-list.patch	application/octet-stream	36.2 KB
v01_0003-EXPLAIN-WORK_MEM-ON-now-shows-working-memory-limit.patch	application/octet-stream	68.4 KB
v01_0004-Add-workmem_hook-to-allow-extensions-to-override-per.patch	application/octet-stream	54.3 KB

In response to

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators at 2025-02-24 20:32:32 from James Hunter

Responses

Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators at 2025-02-24 21:46:49 from James Hunter
Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators at 2025-02-25 02:54:25 from Jeff Davis

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-02-24 20:47:13	Re: PATCH: warn about, and deprecate, clear text passwords
Previous Message	Corey Huinker	2025-02-24 20:45:10	Re: Statistics Import and Export