Re: Identify huge pages accessibility using madvise

From: Gabriele Bartolini <gabriele(dot)bartolini(at)enterprisedb(dot)com>
To: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Identify huge pages accessibility using madvise
Date: 2024-09-26 05:57:12
Message-ID: CA+VUV5pUzKp=hDnahV9Wfr12cJE6Cq_SpBZ=3b9AV=_BuwJN7g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Dmitry,

I've been attempting to replicate this issue directly in Kubernetes, but I
haven't been successful so far. I've been using EKS nodes, and it seems
that they all run cgroup v2 now. Do you have anything that could help me
get started on this more quickly?

Thanks,
Gabriele

On Sat, 13 Apr 2024 at 18:24, Dmitry Dolgov <9erthalion6(at)gmail(dot)com> wrote:

> Hi,
>
> I would like to propose a small patch to address an annoying issue with
> the way how PostgreSQL does fallback in case if "huge_pages = try" is
> set. Here is how the problem looks like:
>
> * PostgreSQL is starting on a machine with some huge pages available
>
> * It tries to identify that fact and does mmap with MAP_HUGETLB, which
> succeeds
>
> * But it has a pleasure to run inside a cgroup with a hugetlb
> controller and limits set to 0 (or anything less than PostgreSQL
> needs)
>
> * Under this circumstances PostgreSQL will proceed allocating huge
> pages, but the first page fault will trigger SIGBUS
>
> I've sketched out how to reproduce it with cgroup v1 and v2 in the
> attached scripts.
>
> This sounds like quite a rare combination of factors, but apparently
> it's fairly easy to face this on K8s/OpenShift. There was a bug reported
> some time ago [1] about this behaviour, and back then I was under the
> impression it's a solved matter with nothing to do. Yet I still observe
> this type of issues, the latest one not longer than a week ago.
>
> After some research I found what looks to me like a relatively simple
> way to address the problem. In Linux kernel 5.14 a new flag to madvise
> was introduced that might be just what we need here. It's called
> MADV_POPULATE_READ [2] and it tells kernel to populate page tables by
> triggering read faults if required. One by-design feature of this flag
> is to fail the madvise call in the situations like one above, giving an
> opportunity to avoid SIGBUS.
>
> I've outlined a patch to implement this approach and tested it on a
> newish Linux kernel I've got lying around (6.9.0-rc1) -- no SIGBUS,
> PostgreSQL does fallback to not use huge pages. The resulting change
> seems to be small enough to justify addressing this small but annoying
> issue. Any thoughts or commentaries about the proposal?
>
> [1]:
> https://www.postgresql.org/message-id/flat/HE1PR0701MB256920EEAA3B2A9C06249F339E110%40HE1PR0701MB2569.eurprd07.prod.outlook.com
> [2]:
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4ca9b3859dac14bbef0c27d00667bb5b10917adb
>

--
Gabriele Bartolini
VP, Chief Architect, Kubernetes
enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-09-26 05:57:17 Re: Using per-transaction memory contexts for storing decoded tuples
Previous Message Ashutosh Bapat 2024-09-26 05:45:06 Re: meson and check-tests