Re: remap the .text segment into huge pages at run time

From: Andres Freund <andres(at)anarazel(dot)de>
To: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: remap the .text segment into huge pages at run time
Date: 2022-11-05 08:27:48
Message-ID: 20221105082748.dgb57maldyvvpv6n@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2022-11-05 12:54:18 +0700, John Naylor wrote:
> On Sat, Nov 5, 2022 at 1:33 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > I hacked in a MADV_COLLAPSE (with setarch -R, so that I could just
> hardcode
> > the address / length), and it seems to work nicely.
> >
> > With the weird caveat that on fs one needs to make sure that the
> executable
> > doesn't reflinks to reuse parts of other files, and that the mold linker
> and
> > cp do... Not a concern on ext4, but on xfs. I took to copying the postgres
> > binary with cp --reflink=never
>
> What happens otherwise? That sounds like a difficult thing to guard against.

MADV_COLLAPSE fails, but otherwise things continue on. I think it's mostly an
issue on dev systems, not on prod systems, because there the files will be be
unpacked from a package or such.

> > On 2022-11-03 10:21:23 -0700, Andres Freund wrote:
> > > > - Add a "cold" __asm__ filler function that just takes up space,
> enough to
> > > > push the end of the .text segment over the next aligned boundary, or
> to
> > > > ~8MB in size.
> > >
> > > I don't understand why this is needed - as long as the pages are
> aligned to
> > > 2MB, why do we need to fill things up on disk? The in-memory contents
> are the
> > > relevant bit, no?
> >
> > I now assume it's because you either observed the mappings set up by the
> > loader to not include the space between the segments?
>
> My knowledge is not quite that deep. The iodlr repo has an example "hello
> world" program, which links with 8 filler objects, each with 32768
> __attribute((used)) dummy functions. I just cargo-culted that idea and
> simplified it. Interestingly enough, looking through the commit history,
> they used to align the segments via linker flags, but took it out here:
>
> https://github.com/intel/iodlr/pull/25#discussion_r397787559
>
> ...saying "I'm not sure why we added this". :/

That was about using a linker script, not really linker flags though.

I don't think the dummy functions are a good approach, there were plenty
things after it when I played with them.

> I quickly tried to align the segments with the linker and then in my patch
> have the address for mmap() rounded *down* from the .text start to the
> beginning of that segment. It refused to start without logging an error.

Hm, what linker was that? I did note that you need some additional flags for
some of the linkers.

> > With these flags the "R E" segments all start on a 0x200000/2MiB boundary
> and
> > are padded to the next 2MiB boundary. However the OS / dynamic loader only
> > maps the necessary part, not all the zero padding.
> >
> > This means that if we were to issue a MADV_COLLAPSE, we can before it do
> an
> > mremap() to increase the length of the mapping.
>
> I see, interesting. What location are you passing for madvise() and
> mremap()? The beginning of the segment (for me has .init/.plt) or an
> aligned boundary within .text?

I started postgres with setarch -R, looked at /proc/$pid/[s]maps to see the
start/end of the r-xp mapped segment. Here's my hacky code, with a bunch of
comments added.

void *addr = (void*) 0x555555800000;
void *end = (void *) 0x555555e09000;
size_t advlen = (uintptr_t) end - (uintptr_t) addr;

const size_t bound = 1024*1024*2 - 1;
size_t advlen_up = (advlen + bound - 1) & ~(bound - 1);
void *r2;

/*
* Increase size of mapping to cover the tailing padding to the next
* segment. Otherwise all the code in that range can't be put into
* a huge page (access in the non-mapped range needs to cause a fault,
* hence can't be in the huge page).
* XXX: Should proably assert that that space is actually zeroes.
*/
r2 = mremap(addr, advlen, advlen_up, 0);
if (r2 == MAP_FAILED)
fprintf(stderr, "mremap failed: %m\n");
else if (r2 != addr)
fprintf(stderr, "mremap wrong addr: %m\n");
else
advlen = advlen_up;

/*
* The docs for MADV_COLLAPSE say there should be at least one page
* in the mapped space "for every eligible hugepage-aligned/sized
* region to be collapsed". I just forced that. But probably not
* necessary.
*/
r = madvise(addr, advlen, MADV_WILLNEED);
if (r != 0)
fprintf(stderr, "MADV_WILLNEED failed: %m\n");

r = madvise(addr, advlen, MADV_POPULATE_READ);
if (r != 0)
fprintf(stderr, "MADV_POPULATE_READ failed: %m\n");

/*
* Make huge pages out of it. Requires at least linux 6.1. We could
* fall back to MADV_HUGEPAGE if it fails, but it doesn't do all that
* much in older kernels.
*/
#define MADV_COLLAPSE 25
r = madvise(addr, advlen, MADV_COLLAPSE);
if (r != 0)
fprintf(stderr, "MADV_COLLAPSE failed: %m\n");

A real version would have to open /proc/self/maps and do this for at least
postgres' r-xp mapping. We could do it for libraries too, if they're suitably
aligned (both in memory and on-disk).

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2022-11-05 08:46:14 Re: psql: Add command to use extended query protocol
Previous Message Corey Huinker 2022-11-05 06:34:47 Re: psql: Add command to use extended query protocol