From: | Joe Conway <mail(at)joeconway(dot)com> |
---|---|
To: | Ancoron Luciferis <ancoron(dot)luciferis(at)googlemail(dot)com>, pgsql-general(at)lists(dot)postgresql(dot)org |
Subject: | Re: Kubernetes, cgroups v2 and OOM killer - how to avoid? |
Date: | 2025-04-07 13:21:34 |
Message-ID: | d202ea2e-2dcb-4ba9-8d29-eeb4ed901c4f@joeconway.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 4/5/25 07:53, Ancoron Luciferis wrote:
> I've been investigating this topic every now and then but to this day
> have not come to a setup that consistently leads to a PostgreSQL backend
> process receiving an allocation error instead of being killed externally
> by the OOM killer.
>
> Why this is a problem for me? Because while applications are accessing
> their DBs (multiple services having their own DBs, some high-frequency),
> the whole server goes into recovery and kills all backends/connections.
>
> While my applications are written to tolerate that, it also means that
> at that time, esp. for the high-frequency apps, events are piling up,
> which then leads to a burst as soon as connectivity is restored. This in
> turn leads to peaks in resource usage in other places (event store,
> in-memory buffers from apps, ...), which sometimes leads to a series of
> OOM killer events being triggered, just because some analytics query
> went overboard.
>
> Ideally, I'd find a configuration that only terminates one backend but
> leaves the others working.
>
> I am wondering whether there is any way to receive a real ENOMEM inside
> a cgroup as soon as I try to allocate beyond its memory.max, instead of
> relying on the OOM killer.
>
> I know the recommendation is to have vm.overcommit_memory set to 2, but
> then that affects all workloads on the host, including critical infra
> like the kubelet, CNI, CSI, monitoring, ...
>
> I have already gone through and tested the obvious:
>
> https://www.postgresql.org/docs/current/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
Importantly vm.overcommit_memory set to 2 only matters when memory is
constrained at the host level.
As soon as you are running in a cgroup with a hard memory limit,
vm.overcommit_memory is irrelevant.
You can have terabytes of free memory on the host, but if cgroup memory
usage exceeds memory.limit (cgv1) or memory.max (cgv2) the OOM killer
will pick the process in the cgroup with the highest oom_score and whack it.
Unfortunately there is no equivalent to vm.overcommit_memory within the
cgroup.
> And yes, I know that Linux cgroups v2 memory.max is not an actual hard
> limit:
>
> https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files
Read that again -- memory.max *is* a hard limit (same as memory.limit in
cgv1).
"memory.max
A read-write single value file which exists on non-root cgroups. The
default is “max”.
Memory usage hard limit. This is the main mechanism to limit memory
usage of a cgroup. If a cgroup’s memory usage reaches this limit and
can’t be reduced, the OOM killer is invoked in the cgroup."
If you want a soft limit use memory.high.
"memory.high
A read-write single value file which exists on non-root cgroups. The
default is “max”.
Memory usage throttle limit. If a cgroup’s usage goes over the high
boundary, the processes of the cgroup are throttled and put under
heavy reclaim pressure.
Going over the high limit never invokes the OOM killer and under
extreme conditions the limit may be breached. The high limit should
be used in scenarios where an external process monitors the limited
cgroup to alleviate heavy reclaim pressure.
You want to be using memory.high rather than memory.max.
Also, I don't know what kubernetes recommends these days, but it used to
require you to disable swap. In more recent versions of kubernetes you
are able to run with swap enabled but I have no idea what the default is
-- make sure you run with swap enabled.
The combination of some swap being available, and the throttling under
heavy reclaim will likely mitigate your problems.
--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Laurenz Albe | 2025-04-07 13:31:06 | Re: find replication slots that "belong" to a publication |
Previous Message | Costa Alexoglou | 2025-04-07 12:44:58 | Performance regression when adding LIMIT 1 to a query |