Re: Changing shared_buffers without restart

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Peter Eisentraut <peter(at)eisentraut(dot)org>
Cc: Dmitry Dolgov <9erthalion6(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Changing shared_buffers without restart
Date: 2025-04-17 15:54:31
Message-ID: CA+hUKGJ-RfwSe3=ZS2HRV9rvgrZTJJButfE8Kh5C6Ta2Eb+mPQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Nov 21, 2024 at 8:55 PM Peter Eisentraut <peter(at)eisentraut(dot)org> wrote:
> On 19.11.24 14:29, Dmitry Dolgov wrote:
> >> I see that memfd_create() has a MFD_HUGETLB flag. It's not very clear how
> >> that interacts with the MAP_HUGETLB flag for mmap(). Do you need to specify
> >> both of them if you want huge pages?
> > Correct, both (one flag in memfd_create and one for mmap) are needed to
> > use huge pages.
>
> I was worried because the FreeBSD man page says
>
> MFD_HUGETLB This flag is currently unsupported.
>
> It looks like FreeBSD doesn't have MAP_HUGETLB, so maybe this is irrelevant.
>
> But you should make sure in your patch that the right set of flags for
> huge pages is passed.

MFD_HUGETLB does actually work on FreeBSD, but the man page doesn't
admit it (guessing an oversight, not sure, will see). And you don't
need the corresponding (non-existent) mmap flag. You also have to
specify a size eg MFD_HUGETLB | MFD_HUGE_2MB or you get ENOTSUPP, but
other than that quirk I see it definitely working with eg procstat -v.
That might be because FreeBSD doesn't have a default huge page size
concept? On Linux that's a boot time setting, I guess rarely changed.
I contemplated that once before, when I wrote a quick demo patch[1] to
implement huge_pages=on for FreeBSD (ie explicit rather than
transparent). I used a different function, not the Linuxoid one but
it's the same under the covers, and I wrote:

+ /*
+ * Find the matching page size index, or if huge_page_size wasn't set,
+ * then skip the smallest size and take the next one after that.
+ */

Swapping that topic back in, I was left wondering: (1) how to choose
between SHM_LARGEPAGE_ALLOC_DEFAULT, a policy that will cause
ftruncate() to try to defragment physical memory to fulfil your
request and can eat some serious CPU, and SHM_LARGEPAGE_ALLOC_NOWAIT,
and (2) if it's the second thing, well Linux is like that in respect
of failing fast, but for it to succeed you have to configure
nr_hugepages in the OS as a separate administrative step and *that's*
when it does any defragmentation required, and that's another concept
FreeBSD doesn't have. It's a bit of a weird concept too, I mean those
pages are not reserved for you in any way and anyone could nab them,
which is undeniably practical but it lacks a few qualities one might
hope for in a kernel facility... IDK. Anyway, the Linux-like
memfd_create() always does it the _DEFAULT way. EIther way, we can't
have identical "try" semantics: it'll actually put some effort into
trying, perhaps burning many seconds of CPU.

I took a peek at what we're doing for Windows and the man pages tell
me that it's like that too. I don't recall hearing any complaints
about that, but it's gated on a Windows permission that I assume very
few enabled, so "try" probably isn't trying for most systems.
Quoting:

"Large-page memory regions may be difficult to obtain after the system
has been running for a long time because the physical space for each
large page must be contiguous, but the memory may have become
fragmented. Allocating large pages under these conditions can
significantly affect system performance. Therefore, applications
should avoid making repeated large-page allocations and instead
allocate all large pages one time, at startup."

For Windows we also interpret "on" with GetLargePageMinimum(), which
sounds like my "second known page size" idea.

To make Windows do the thing that this thread wants, I found a thread
saying that calling VirtualAlloc(..., MEM_RESET) and then convincing
every process to call VirtualUnlock(...) might work:

https://groups.google.com/g/microsoft.public.win32.programmer.kernel/c/3SvznY38SSc/m/4Sx_xwon1vsJ

I'm not sure what to do about the other Unixen. One option is
nothing, no feature, patches welcome. Another is to use
shm_open(<made up name>), like DSM segments, except we never need to
reopen these ones so we could immediately call shm_unlink() to leave
only a very short window to crash and leak a name. It'd be low risk
name pollution in a name space that POSIX forgot to provide any way to
list. The other idea is non-standard madvise tricks but they seem
far too squishy to be part of a "portable" fallback if they even work
at all, so it might be better not to have the feature than that I
think.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tender Wang 2025-04-17 17:01:14 Re: not null constraints, again
Previous Message Jacob Champion 2025-04-17 15:48:05 Re: jsonapi: scary new warnings with LTO enabled