Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators

From: James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com>
To: Jeff Davis <pgsql(at)j-davis(dot)com>, "pg(at)bowt(dot)ie" <pg(at)bowt(dot)ie>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators
Date: 2025-02-11 03:12:50
Message-ID: CAJVSvF54WHx=zBj7Q-7XW5-qHP7L+PoZ8cZvuOPKic=rZTWYCQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I hope to have an initial patch-set for a prototype, within the next
couple of weeks. But I wanted to add some design comments to this
thread, first, to solicit feedback, etc. —

First, some bookkeeping: Peter Geoghegan pointed me, offline, to
Oracle’s 2002 paper [1] on how they managed SQL execution memory in
9i. I found it helpful to compare my proposal to what Oracle did. The
main difference I see is that Oracle modified their SQL operators to
“give back” memory, at runtime, when the resource manager reduces the
per-operator memory limit. Doing this causes its own problems, but it
allows Oracle to maintain a single “per-operator” memory limit that
applies to *all* operators; see Figure 6.

I am reluctant to make a similar change to PostgreSQL, because (1) it
would involve a lot of code churn, and (2) it’s not clear to me that
this is a good path to take. Note that the Oracle design allows total
memory to exceed the global memory limit, temporarily, while the
system waits for running operators to give their memory back. So, the
paper describes how Oracle tries to anticipate this situation and
reduce the per-operator memory limit in advance... but I have not had
a good experience with that sort of strategy, in the cloud.

The Oracle design necessarily overprovisions some operators, because
it assigns the same limit to all operators. (See, again, Figure 6,
which makes all of this clearer than anything I could write.) It
relies on detecting when an overprovisioned operator starts to use
more of the memory it was provisioned... and then quickly reducing the
per-operator limit, so that other operators give up their memory for
use by the previously-overprovisioned operator. In this way, the
Oracle design is very fair.

However, while waiting for the other operators to give up their memory
(since they are now oversubscribed), the system temporarily exceeds
the global memory limit. This opens up a can of worms, but it seems
like the Oracle paper deals with this situation by letting the
excessive memory swap to disk (see Figures 10 and 11).

I don’t want to modify PostgreSQL operators so they can give up memory
at runtime. So this forces my solution to do two things: (1) provide
different operators different memory limits, since I can’t take memory
away from an operator after it has started running; and (2) give each
operator (at least) an initial memory reservation, before it starts
running. Hence, the approach I described earlier in this thread.

Second, some motivation: the cloud makes the resource management
problem worse than it is on-premise. I would refer to page 2 of the
Oracle doc (too long to quote here), as justification for moving away
from the “work_mem” GUC, but note that these arguments apply more
strongly to cloud databases, for two reasons. First reason: swap can
be prohibitively expensive in the cloud, and spilling very expensive.
This is because cloud instances frequently lack attached, ephemeral
storage. Cloud remote storage can be extremely slow [2]: “For example,
a gp2 volume under 1,000 GiB with burst credits available has ... a
volume throughput limit of 250 MiB/s.”

Second reason: any cloud provider has an effectively infinite number
of customer instances. I mean that this number is large enough that
the cloud provider cannot afford to manage these instances, except via
automated tools. So, when the Oracle paper says, “ Generally, the DBA
tries to avoid over-allocation by assuming the worst workload in order
to avoid paging (with dramatic degradation in performance) or query
failure.” The degradation is more dramatic in the cloud, and the cost
of under-utilization is higher.

Also also: “In most commercial systems the burden has been put on the
DBA to provide an optimal setting for configuration parameters that
are internally used to decide how much memory to allocate to a given
database operator. This is a challenging task for the DBA...” This is
an impossible task for the cloud provider!

Thanks,
James

[1] https://www.vldb.org/conf/2002/S29P03.pdf
[2] https://docs.aws.amazon.com/ebs/latest/userguide/ebs-io-characteristics.html#ebs-io-size-throughput-limits

On Mon, Feb 10, 2025 at 7:09 PM James Hunter <james(dot)hunter(dot)pg(at)gmail(dot)com> wrote:
>
> On Fri, Jan 24, 2025 at 5:48 PM Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> >
> > On Fri, 2025-01-24 at 17:04 -0800, James Hunter wrote:
> > > Generating "high memory" vs. "low memory" paths would be tricky,
> > > because the definition of "high" vs. "low" depends on the entire path
> > > tree, not just on a single path node. So I think it would quickly
> > > lead
> > > to a state-space explosion, as you mention.
> >
> > At first, it appears to lead to an explosion, but there are a lot of
> > ways to prune early. ...
> >
> > Maybe my idea doesn't work out, but I think it's too early to dismiss
> > it.
>
> I think it makes sense to split the work into two parts: one part that
> improves SQL execution, and a second part that improves the optimizer,
> to reflect the improvements to execution.
>
> It seems better to me to wait until we have the ability to enforce
> memory limits, before worrying about ways to generate different paths
> with different memory limits. Then we would be able to tune the
> optimizer heuristics based on the actual executor, instead of
> extrapolating how the executor would behave under different memory
> limits.
>
> James

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2025-02-11 03:19:20 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message James Hunter 2025-02-11 03:09:25 Re: Proposal: "query_work_mem" GUC, to distribute working memory to the query's individual operators