Draft for basic NUMA observability

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Subject: Draft for basic NUMA observability
Date: 2025-02-07 14:32:43
Message-ID: CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N=q1w+DiH-696Xw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

As I have promised to Andres on the Discord hacking server some time
ago, I'm attaching the very brief (and potentially way too rushed)
draft of the first step into NUMA observability on PostgreSQL that was
based on his presentation [0]. It might be rough, but it is to get us
started. The patches were not really even basically tested, they are
more like input for discussion - rather than solid code - to shake out
what should be the proper form of this.

Right now it gives:

postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+-------
| 16127
6 | 256
1 | 1

Changes since the version posted on Discord:

1. libnuma to centralize dependency in the build process (to be future
proof; gives opportunity to use e.g. numa_set_localalloc()). BTW: why
is a specific autoconf version (2.69) required?
2. per-page get_mempolicy(2) syscall was changed to 1x call of
migrate_pages(2) by Bertrand
3. enhancement to support huge pages (with the above) and code to
reduce no of pages for inquiry by doing DB block <-> OS memory pages
mapping. This is a bit hard for me and I'm pretty sure it could be
done somewhat better.

Some other points:
a. plenty of FIXMEs inside and I bet I could screw-up the void *ptr
calculations , but we somehow need to support scenarios like BLCKSZ=2k
.. 32kB @ page sizes 4kB,2M,16M
b. I don't think it makes sense to expose users to bitmaps or int[]
arrays, so there's no support showing that potentially 1 DB block
spans 2 OS memory pages (I think it should be rare!)
c. we probably should switch to numa_move_pages(3) from libnuma, right?
d. earlier Andres wrote:
> IME using pg_buffercache_pages() is often too expensive due to the per-row overhead. I think we'd probably want a number-of-pages-per-numa-node function
> that does the grouping in C. Compare how fast pg_buffercache_summary() is to doing the grouping in SQL when using larger shared_buffers settings.
I think it doesn't make a lot of sense to introduce *new*
pg_buffercache_numa_usage_summary() for this, if we can go straight
for pg_shmallocations_numa view instead, shouldn't we? It will give a
much better picture for everything else for free.

Patches and co-authors are more than welcome!

-J.

[0] - https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf

Attachment Content-Type Size
0001-Extend-pg_buffercache-to-also-show-NUMA-zone-id-allo.patch application/octet-stream 10.2 KB
0001-Add-optional-dependency-to-libnuma-for-basic-NUMA-aw.patch application/octet-stream 5.1 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Shubham Khanna 2025-02-07 14:41:34 Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.
Previous Message Zhang Mingli 2025-02-07 14:24:36 Re: Proposal to CREATE FOREIGN TABLE LIKE