remap the .text segment into huge pages at run time

From: John Naylor <john(dot)naylor(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: remap the .text segment into huge pages at run time
Date: 2022-11-02 06:32:37
Message-ID: CAFBsxsHx9z45MfsAjELFiPv_kcgCcH_P5jNa=WaeGxO7HU3mag@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

It's been known for a while that Postgres spends a lot of time translating
instruction addresses, and using huge pages in the text segment yields a
substantial performance boost in OLTP workloads [1][2]. The difficulty is,
this normally requires a lot of painstaking work (unless your OS does
superpage promotion, like FreeBSD).

I found an MIT-licensed library "iodlr" from Intel [3] that allows one to
remap the .text segment to huge pages at program start. Attached is a
hackish, Meson-only, "works on my machine" patchset to experiment with this
idea.

0001 adapts the library to our error logging and GUC system. The overview:

- read ELF info to get the start/end addresses of the .text segment
- calculate addresses therein aligned at huge page boundaries
- mmap a temporary region and memcpy the aligned portion of the .text
segment
- mmap aligned start address to a second region with huge pages and
MAP_FIXED
- memcpy over from the temp region and revoke the PROT_WRITE bit

The reason this doesn't "saw off the branch you're standing on" is that the
remapping is done in a function that's forced to live in a different
segment, and doesn't call any non-libc functions living elsewhere:

static void
__attribute__((__section__("lpstub")))
__attribute__((__noinline__))
MoveRegionToLargePages(const mem_range * r, int mmap_flags)

Debug messages show

2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text start: 0x487540
2022-11-02 12:02:31.064 +07 [26955] DEBUG: .text end: 0x96cf12
2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text start: 0x600000
2022-11-02 12:02:31.064 +07 [26955] DEBUG: aligned .text end: 0x800000
2022-11-02 12:02:31.066 +07 [26955] DEBUG: binary mapped to huge pages
2022-11-02 12:02:31.066 +07 [26955] DEBUG: un-mmapping temporary code
region

Here, out of 5MB of Postgres text, only 1 huge page can be used, but that
still saves 512 entries in the TLB and might bring a small improvement. The
un-remapped region below 0x600000 contains the ~600kB of "cold" code, since
the linker puts the cold section first, at least recent versions of ld and
lld.

0002 is my attempt to force the linker's hand and get the entire text
segment mapped to huge pages. It's quite a finicky hack, and easily broken
(see below). That said, it still builds easily within our normal build
process, and maybe there is a better way to get the effect.

It does two things:

- Pass the linker -Wl,-zcommon-page-size=2097152
-Wl,-zmax-page-size=2097152 which aligns .init to a 2MB boundary. That's
done for predictability, but that means the next 2MB boundary is very
nearly 2MB away.

- Add a "cold" __asm__ filler function that just takes up space, enough to
push the end of the .text segment over the next aligned boundary, or to
~8MB in size.

In a non-assert build:

0001:

$ bloaty inst-perf/bin/postgres

FILE SIZE VM SIZE
-------------- --------------
53.7% 4.90Mi 58.7% 4.90Mi .text
...
100.0% 9.12Mi 100.0% 8.35Mi TOTAL

$ readelf -S --wide inst-perf/bin/postgres

[Nr] Name Type Address Off Size ES
Flg Lk Inf Al
...
[12] .init PROGBITS 0000000000486000 086000 00001b 00
AX 0 0 4
[13] .plt PROGBITS 0000000000486020 086020 001520 10
AX 0 0 16
[14] .text PROGBITS 0000000000487540 087540 4e59d2 00
AX 0 0 16
...

0002:

$ bloaty inst-perf/bin/postgres

FILE SIZE VM SIZE
-------------- --------------
46.9% 8.00Mi 69.9% 8.00Mi .text
...
100.0% 17.1Mi 100.0% 11.4Mi TOTAL

$ readelf -S --wide inst-perf/bin/postgres

[Nr] Name Type Address Off Size ES
Flg Lk Inf Al
...
[12] .init PROGBITS 0000000000600000 200000 00001b 00
AX 0 0 4
[13] .plt PROGBITS 0000000000600020 200020 001520 10
AX 0 0 16
[14] .text PROGBITS 0000000000601540 201540 7ff512 00
AX 0 0 16
...

Debug messages with 0002 shows 6MB mapped:

2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text start: 0x601540
2022-11-02 12:35:28.482 +07 [28530] DEBUG: .text end: 0xe00a52
2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text start: 0x800000
2022-11-02 12:35:28.482 +07 [28530] DEBUG: aligned .text end: 0xe00000
2022-11-02 12:35:28.486 +07 [28530] DEBUG: binary mapped to huge pages
2022-11-02 12:35:28.486 +07 [28530] DEBUG: un-mmapping temporary code
region

Since the front is all-cold, and there is very little at the end,
practically all hot pages are now remapped. The biggest problem with the
hackish filler function (in addition to maintainability) is, if explicit
huge pages are turned off in the kernel, attempting mmap() with MAP_HUGETLB
causes complete startup failure if the .text segment is larger than 8MB. I
haven't looked into what's happening there yet, but I didn't want to get
too far in the weeds before getting feedback on whether the entire approach
in this thread is sound enough to justify working further on.

[1] https://www.cs.rochester.edu/u/sandhya/papers/ispass19.pdf
(paper: "On the Impact of Instruction Address Translation Overhead")
[2] https://twitter.com/AndresFreundTec/status/1214305610172289024
[3] https://github.com/intel/iodlr

--
John Naylor
EDB: http://www.enterprisedb.com

Attachment Content-Type Size
v1-0002-Put-all-non-cold-.text-in-huge-pages.patch application/x-patch 3.1 KB
v1-0001-Partly-remap-the-.text-segment-into-huge-pages-at.patch application/x-patch 12.7 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2022-11-02 06:53:21 Re: Incorrect include file order in guc-file.l
Previous Message Amit Kapila 2022-11-02 06:24:11 Re: Improve description of XLOG_RUNNING_XACTS