Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: david_sisson(at)dell(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org, PG Bug reporting form <noreply(at)postgresql(dot)org>
Subject: Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes
Date: 2023-01-22 00:27:04
Message-ID: 20230122002704.yoskrrfkbgi7xcfs@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2023-01-21 15:29:22 -0800, Andres Freund wrote:
> On 2023-01-22 00:10:29 +0100, Tomas Vondra wrote:
> > On 1/20/23 23:48, PG Bug reporting form wrote:
> > > In these cases, the initdb phase will attempt to allocate huge pages that
> > > are available in the OS, but it will be denied access by Kubernetes and
> > > fail.
> >
> > Well, so how exactly this fails? Does that mean Kubernetes broke mmap()
> > with MAP_HUGETLB so that it doesn't return MAP_FAILED when hugepages are
> > not available, or what? Because that's the only explanation I can see,
> > looking at the code.
>
> Yea, that's what I was wondering about as well.
>
>
> > Or it just does not realize there are no hugepages, returns something
> > and then crashes with SIGBUS later when trying to access it?
>
> I assume that that's the case. There's references to bus errors in a bunch of
> the linked issues. E.g.
> https://github.com/CrunchyData/postgres-operator/issues/413
>
> selecting default max_connections ... sh: line 1: 60 Bus error (core dumped) "/usr/pgsql-10/bin/postgres" --boot -x0 -F -c max_connections=100 -c shared_buffers=1000 -c dynamic_shared_memory_type=none < "/dev/null" > "/dev/null" 2>&1
>
> It's possible that the problem would go away if we used MAP_POPULATE for the
> allocation.

> I'd guess that this is annoying cgroups stuff :(

Ah, the fun:
https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/hugetlb.html

The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per
control group and enforces the limit during page fault. Since HugeTLB
doesn't support page reclaim, enforcing the limit at page fault time implies
that, the application will get SIGBUS signal if it tries to fault in HugeTLB
pages beyond its limit. Therefore the application needs to know exactly how many
HugeTLB pages it uses before hand, and the sysadmin needs to make sure that
there are enough available on the machine for all the users to avoid processes
getting SIGBUS.

but there's also

Reservation accounting

hugetlb.<hugepagesize>.rsvd.limit_in_bytes hugetlb.<hugepagesize>.rsvd.max_usage_in_bytes hugetlb.<hugepagesize>.rsvd.usage_in_bytes hugetlb.<hugepagesize>.rsvd.failcnt

The HugeTLB controller allows to limit the HugeTLB reservations per control
group and enforces the controller limit at reservation time and at the fault
of HugeTLB memory for which no reservation exists. Since reservation limits
are enforced at reservation time (on mmap or shget), reservation limits
never causes the application to get SIGBUS signal if the memory was reserved
before hand. For MAP_NORESERVE allocations, the reservation limit behaves
the same as the fault limit, enforcing memory usage at fault time and
causing the application to receive a SIGBUS if it’s crossing its limit.

Reservation limits are superior to page fault limits described above, since
reservation limits are enforced at reservation time (on mmap or shget), and
never causes the application to get SIGBUS signal if the memory was reserved
before hand. This allows for easier fallback to alternatives such as
non-HugeTLB memory for example. In the case of page fault accounting, it’s
very hard to avoid processes getting SIGBUS since the sysadmin needs
precisely know the HugeTLB usage of all the tasks in the system and make
sure there is enough pages to satisfy all requests. Avoiding tasks getting
SIGBUS on overcommited systems is practically impossible with page fault
accounting.

So the problem is that the wrong time of cgroup limits are used. I don't know
if that's a kubernetes or a postgres-operator issue.

Greetings,

Andres Freund

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Tomas Vondra 2023-01-22 00:55:01 Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes
Previous Message Tom Lane 2023-01-22 00:08:01 Re: BUG #17757: Not honoring huge_pages setting during initdb causes DB crash in Kubernetes