Re: Changing shared_buffers without restart

From: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>
To: Matthias van de Meent <boekewurm+postgres(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Changing shared_buffers without restart
Date: 2024-12-02 19:17:59
Message-ID: lnilchj4anxlfog3vgyeztmbiapfv2grpyh6pbxfl5pzg6nefb@bzbjxqzm2imf
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> On Fri, Nov 29, 2024 at 05:47:27PM GMT, Dmitry Dolgov wrote:
> > On Fri, Nov 29, 2024 at 01:56:30AM GMT, Matthias van de Meent wrote:
> >
> > I mean, we can do the following to get a nice contiguous empty address
> > space no other mmap(NULL)s will get put into:
> >
> > /* reserve size bytes of memory */
> > base = mmap(NULL, size, PROT_NONE, ...flags, ...);
> > /* use the first small_size bytes of that reservation */
> > allocated_in_reserved = mmap(base, small_size, PROT_READ |
> > PROT_WRITE, MAP_FIXED, ...);
> >
> > With the PROT_NONE protection option the OS doesn't actually allocate
> > any backing memory, but guarantees no other mmap(NULL, ...) will get
> > placed in that area such that it overlaps with that allocation until
> > the area is munmap-ed, thus allowing us to reserve a chunk of address
> > space without actually using (much) memory.
>
> From what I understand it's not much different from the scenario when we
> just map as much as we want in advance. The actual memory will not be
> allocated in both cases due to CoW, oom_score seems to be the same. I
> agree it sounds attractive, but after some experimenting it looks like
> it won't work with huge pages insige a cgroup v2 (=container).
>
> The reason is Linux has recently learned to apply memory reservation
> limits on hugetlb inside a cgroup, which are applied to mmap. Nowadays
> this feature is often configured out of the box in various container
> orchestrators, meaning that a scenario "set hugetlb=1GB on a container,
> reserve 32GB with PROT_NONE" will fail. I've also tried to mix and
> match, reserve some address space via non-hugetlb mapping, and allocate
> a hugetlb out of it, but it doesn't work either (the smaller mmap
> complains about MAP_HUGETLB with EINVAL).

I've asked about that in linux-mm [1]. To my surprise, the
recommendations were to stick to creating a large mapping in advance,
and slice smaller mappings out of that, which could be resized later.
The OOM score should not be affected, and hugetlb could be avoided using
MAP_NORESERVE flag for the initial mapping (I've experimented with that,
seems to be working just fine, even if the slices are not using
MAP_NORESERVE).

I guess that would mean I'll try to experiment with this approach as
well. But what others think? How much research do we need to do, to gain
some confidence about large shared mappings and make it realistically
acceptable?

[1]: https://lore.kernel.org/linux-mm/pr7zggtdgjqjwyrfqzusih2suofszxvlfxdptbo2smneixkp7i(at)nrmtbhemy3is/t/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Thomas Munro 2024-12-02 19:25:56 Re: Remove useless casts to (void *)
Previous Message Peter Geoghegan 2024-12-02 18:39:43 Re: Incorrect result of bitmap heap scan.