Re: Draft for basic NUMA observability

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Draft for basic NUMA observability
Date: 2025-03-12 15:41:15
Message-ID: CAKZiRmzwO_KEXH4hkrHmx3qja0kER9pKW=3yyLvEP+b38vVLQg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:

> Thanks for the new version!

v10 is attached with most fixes after review and one new thing
introduced: pg_numa_available() for run-time decision inside tests
which was needed after simplifying code a little bit as you wanted.
I've also fixed -Dlibnuma=disabled as it was not properly implemented.
There are couple open items (minor/decision things), but most is fixed
or answered:

> Some random comments on 0001:
>
> === 1
>
> It does not compiles "alone". It's missing:
>
[..]
> +extern PGDLLIMPORT int huge_pages_status;
[..]
> -static int huge_pages_status = HUGE_PAGES_UNKNOWN;
> +int huge_pages_status = HUGE_PAGES_UNKNOWN;
>
> That come with 0002. So it has to be in 0001 instead.

Ugh, good catch, I haven't thought about it in isolation, they are
separate to just ease review, but should be committed together. Fixed.

> === 2
>
> +else
> + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> +fi
>
> Maybe we should add the same test (checking for numa.h) for meson?

TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
between meson and autoconf?

> === 3
>
> +# FIXME: filter-out / with/without with_libnuma?
> +LIBS += $(LIBNUMA_LIBS)
>
> It looks to me that we can remove those 2 lines.

Done.

> === 4
>
> + Only supported on Linux.
>
> s/on Linux/on platforms for which the libnuma library is implemented/?
>
> I did a quick grep on "Only supported on" and it looks like that could be
> a more consistent wording.

Fixed.

> === 5
>
> +#include "c.h"
> +#include "postgres.h"
> +#include "port/pg_numa.h"
> +#include "storage/pg_shmem.h"
> +#include <unistd.h>
>
> I had a closer look to other header files and it looks like it "should" be:
>
> #include "c.h"
> #include "postgres.h"
>
> #include <unistd.h>
> #ifdef WIN32
> #include <windows.h>
> #endif
>
> #include "port/pg_numa.h"
> #include "storage/pg_shmem.h"
>
> And is "#include "c.h"" really needed?

Fixed both. It seems to compile without c.h.

> === 6
>
> +/* FIXME not tested, might crash */
>
> That's a bit scary.

When you are in support for long enough it is becoming the norm ;) But
on serious note Andres wanted have numa error/warning handlers (used
by libnuma), but current code has no realistic code-path to hit it
from numa_available(3), numa_move_pages(3) or numa_max_node(3). The
situation won't change until one day in future (I hope!) we start
using some more advanced libnuma functionality for interleaving
memory, please see my earlier reply:
https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
E.g. numa_available(3) is tiny wrapper , see
https://github.com/numactl/numactl/blob/master/libnuma.c#L872

For now, I've adjusted that FIXME to XXX, but still don't know we
could inject failure to see this triggered...

> === 7
>
> + * XXX: for now we issue just WARNING, but long-term that might depend on
> + * numa_set_strict() here
>
> s/here/here./

Done.

> Did not look carefully at all the comments in 0001, 0002 and 0003 though.
>
> A few random comments regarding 0002:
>
> === 8
>
> # create extension pg_buffercache;
> ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
> CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
> RETURNS SETOF RECORD
> AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> LANGUAGE C PARALLEL SAFE"
> extension script file "pg_buffercache--1.2.sql", near line 7
>
> While that's ok if 0003 is applied. I think that each individual patch should
> "fully" work individually.

STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
0002 or 0001 + 0003, but here 0002 is complaining about lack of
pg_numa_query_pages() which is part of 0001 (?) Should I merge those
patches or keep them separate?

> === 9
>
> + do
> + {
> + if (os_page_ptrs[blk2page + j] == 0)
>
> blk2page + j will be repeated multiple times, better to store it in a local
> variable instead?

Done.

> === 10
>
> + if (firstUseInBackend == true)
>
>
> if (firstUseInBackend) instead?

Done everywhere.

> === 11
>
> + int j = 0,
> + blk2page = (int) i * pages_per_blk;
>
> I wonder if size_t is more appropriate for blk2page:
>
> size_t blk2page = (size_t)(i * pages_per_blk) maybe?

Sure, done.

> === 12
>
> as we know that we'll iterate until pages_per_blk then would a for loop be more
> appropriate here, something like?
>
> "
> for (size_t j = 0; j < pages_per_blk; j++)
> {
> if (os_page_ptrs[blk2page + j] == 0)
> {
> "

Sure.

> === 13
>
> + if (buf_state & BM_DIRTY)
> + fctx->record[i].isdirty = true;
> + else
> + fctx->record[i].isdirty = false;
>
> fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
>
> === 14
>
> + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> + fctx->record[i].isvalid = true;
> + else
> + fctx->record[i].isvalid = false;
>
> fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> maybe?

It is coming from the original pg_buffercache and it is less readable
to me, so I don't want to touch that too much because that might open
refactoring doors too wide.

> === 15
>
> +populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
>
> I wonder if "i" should get a more descriptive name?

Done, buffer_id.

> === 16
>
> s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
> with pg_stat_io_build_tuples for example).
>
> I now realize that I did propose populate_buffercache_entry() up-thread,
> sorry for changing my mind.

OK, I've tried to create a better name, but all my ideas were too
long, but pg_buffercache_build_tuple sounds nice, so lets use that.

> === 17
>
> + static bool firstUseInBackend = true;
>
> maybe we should give it a more descriptive name?

I couldn't come up with anything that wouldn't look too long, so
instead I've added a comment explaining the meaning behind this
variable, hope that's good enough.

> Also I need to think more about how firstUseInBackend is used, for example:
>
> ==== 17.1
>
> would it be better to define it as a file scope variable? (at least that
> would avoid to pass it as an extra parameter in some functions).

I have no strong opinion on this, but I have one doubt (for future):
isn't creating global variables making life harder for upcoming
multithreading guys ?

> === 17.2
>
> what happens if firstUseInBackend is set to false and later on the pages
> are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
> called again by a backend that already had set firstUseInBackend to false,
> would that provide accurate information?

It is still the correct result. That "touching" (paging-in) is only
necessary probably to properly resolve PTEs as the fork() does not
seem to carry them over from parent:

postgres=# select 'create table foo' || s || ' as select
generate_series(1, 100000);' from generate_series(1, 4) s;
postgres=# \gexec
SELECT 100000
SELECT 100000
SELECT 100000
SELECT 100000
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id; -- before:
numa_zone_id | count
--------------+---------
0 | 256
1 | 4131
| 8384221
postgres=# select pg_backend_pid();
pg_backend_pid
----------------
1451

-- now use another root (!) session to "migratepages(8)" from numactl
will also shift shm segment:
# migratepages 1451 1 3

-- while above will be in progress for a lot of time , but the outcome
is visible much faster in that backend (pg is functioning):
postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
group by numa_zone_id order by numa_zone_id;
numa_zone_id | count
--------------+---------
0 | 256
3 | 4480
| 8383872

So the above clearly shows that initial touching of shm is required,
but just once and it stays valid afterwards.

BTW: migratepages(8) was stuck for 1-2 minutes there on
"__wait_rcu_gp" according to it's wchan, without any sign of activity
on the OS and then out of blue completed just fine, s_b=64GB,HP=on.

> === 18
>
> +Datum
> +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> +{
> + FuncCallContext *funcctx;
> + BufferCachePagesContext *fctx; /* User function context. */
> + bool query_numa = true;
>
> I don't think we need query_numa anymore in pg_buffercache_numa_pages().
>
> I think that we can just ERROR (or NOTICE and return) here:
>
> + if (pg_numa_init() == -1)
> + {
> + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
> + query_numa = false;
>
> and fully get rid of query_numa.

Right... fixed it here, but it made the tests blow up, so I've had to
find a way to conditionally launch tests based on NUMA availability
and that's how pg_numa_available() was born. It's in ipc/shmem.c
because I couldn't find a better place for it...

> === 19
>
> And then here:
>
> for (i = 0; i < NBuffers; i++)
> {
> populate_buffercache_entry(i, fctx);
> if (query_numa)
> pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
> }
>
> if (query_numa)
> {
> if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
> elog(ERROR, "failed NUMA pages inquiry: %m");
>
> for (i = 0; i < NBuffers; i++)
> {
> int blk2page = (int) i * pages_per_blk;
>
> /*
> * Technically we can get errors too here and pass that to
> * user. Also we could somehow report single DB block spanning
> * more than one NUMA zone, but it should be rare.
> */
> fctx->record[i].numa_zone_id = os_pages_status[blk2page];
> }
>
> maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?

Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper
that prepares **os_page_ptrs, which is then used by
pg_numa_query_pages() and then we fill the data. But now after
removing query_numa , it reads smoother anyway. Can you please take a
look again on this, is this up to the project standards?

> A few random comments regarding 0003:
>
> === 20
>
> + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
>
> s/pg_shmem_allocations/pg_shmem_numa_allocations/?

Fixed.

> === 21
>
> + /* FIXME: should we release LWlock here ? */
> + elog(ERROR, "failed NUMA pages inquiry status: %m");
>
> There is no need, see src/backend/storage/lmgr/README.

Thanks, fixed.

> === 22
>
> +#define MAX_NUMA_ZONES 32 /* FIXME? */
> + Size zones[MAX_NUMA_ZONES];
>
> could we rely on pg_numa_get_max_node() instead?

Sure, done.

> === 23
>
> + if (s >= 0)
> + zones[s]++;
>
> should we also check that s is below a limit?

STILL OPEN QUESTION: I'm not sure it would give us value to report
e.g. -2 on per shm entry/per numa node entry, would it? If it would we
could somehow overallocate that array and remember negative ones too.

> === 24
>
> Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> no possible holes? I mean could a system have node 0,1 and 3 but not 2?

I have never seen a hole in numbering as it is established during
boot, and the only way that could get close to adjusting it could be
making processor books (CPU and RAM together) offline. Even *if*
someone would be doing some preventive hardware maintenance, that
still wouldn't hurt, as we are just using the Id of the zone to
display it -- the max would be already higher. I mean, technically one
could use lsmem(1) (it mentions removable flag, after let's pages and
processes migrated away and from there) and then use chcpu(1) to
--disable CPUs on that zone (to prevent new ones coming there and
allocating new local memory there) and then offlining that memory
region via chmem(1) --disable. Even with all of that, that still
shouldn't cause issues for this code I think, because
`numa_max_node()` says it `returns the >> highest node number <<<
available on the current system.`

> Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> did a little in 0001 though)

That was an attempt to say "Thank You", OK, I've aligned it that way,
so you are still referenced in 0001. I wouldn't find motivation to
work on this if you won't respond to those emails ;)

There is one known issue, CI returned for numa.out test, that I need
to get bottom to (it does not reproduce for me locally) inside
numa.out/sql:

SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90

-J.

Attachment Content-Type Size
v10-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patch application/octet-stream 19.0 KB
v10-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patch application/octet-stream 27.8 KB
v10-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patch application/octet-stream 15.4 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2025-03-12 15:43:17 Re: Log connection establishment timings
Previous Message Andrey Borodin 2025-03-12 15:41:00 Re: Elimination of the repetitive code at the SLRU bootstrap functions.