Quick Links

Re: Draft for basic NUMA observability

From:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Draft for basic NUMA observability
Date:	2025-03-13 09:22:25
Message-ID:	CAKZiRmw+EtapdJrageWrMhn35=cP2XZD4JK6ox_gpHnKsbG-Mg@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Mar 12, 2025 at 4:41 PM Jakub Wartak
<jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
>
> On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> > Thanks for the new version!
>
> v10 is attached with most fixes after review and one new thing
> introduced: pg_numa_available() for run-time decision inside tests
> which was needed after simplifying code a little bit as you wanted.
> I've also fixed -Dlibnuma=disabled as it was not properly implemented.
> There are couple open items (minor/decision things), but most is fixed
> or answered:
>
> > Some random comments on 0001:
> >
> > === 1
> >
> > It does not compiles "alone". It's missing:
> >
> [..]
> > +extern PGDLLIMPORT int huge_pages_status;
> [..]
> > -static int huge_pages_status = HUGE_PAGES_UNKNOWN;
> > +int huge_pages_status = HUGE_PAGES_UNKNOWN;
> >
> > That come with 0002. So it has to be in 0001 instead.
>
> Ugh, good catch, I haven't thought about it in isolation, they are
> separate to just ease review, but should be committed together. Fixed.
>
> > === 2
> >
> > +else
> > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> > +fi
> >
> > Maybe we should add the same test (checking for numa.h) for meson?
>
> TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
> cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
> between meson and autoconf?
>
> > === 3
> >
> > +# FIXME: filter-out / with/without with_libnuma?
> > +LIBS += $(LIBNUMA_LIBS)
> >
> > It looks to me that we can remove those 2 lines.
>
> Done.
>
> > === 4
> >
> > + Only supported on Linux.
> >
> > s/on Linux/on platforms for which the libnuma library is implemented/?
> >
> > I did a quick grep on "Only supported on" and it looks like that could be
> > a more consistent wording.
>
> Fixed.
>
> > === 5
> >
> > +#include "c.h"
> > +#include "postgres.h"
> > +#include "port/pg_numa.h"
> > +#include "storage/pg_shmem.h"
> > +#include <unistd.h>
> >
> > I had a closer look to other header files and it looks like it "should" be:
> >
> > #include "c.h"
> > #include "postgres.h"
> >
> > #include <unistd.h>
> > #ifdef WIN32
> > #include <windows.h>
> > #endif
> >
> > #include "port/pg_numa.h"
> > #include "storage/pg_shmem.h"
> >
> > And is "#include "c.h"" really needed?
>
> Fixed both. It seems to compile without c.h.
>
> > === 6
> >
> > +/* FIXME not tested, might crash */
> >
> > That's a bit scary.
>
> When you are in support for long enough it is becoming the norm ;) But
> on serious note Andres wanted have numa error/warning handlers (used
> by libnuma), but current code has no realistic code-path to hit it
> from numa_available(3), numa_move_pages(3) or numa_max_node(3). The
> situation won't change until one day in future (I hope!) we start
> using some more advanced libnuma functionality for interleaving
> memory, please see my earlier reply:
> https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com
> E.g. numa_available(3) is tiny wrapper , see
> https://github.com/numactl/numactl/blob/master/libnuma.c#L872
>
> For now, I've adjusted that FIXME to XXX, but still don't know we
> could inject failure to see this triggered...
>
> > === 7
> >
> > + * XXX: for now we issue just WARNING, but long-term that might depend on
> > + * numa_set_strict() here
> >
> > s/here/here./
>
> Done.
>
> > Did not look carefully at all the comments in 0001, 0002 and 0003 though.
> >
> > A few random comments regarding 0002:
> >
> > === 8
> >
> > # create extension pg_buffercache;
> > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
> > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
> > RETURNS SETOF RECORD
> > AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> > LANGUAGE C PARALLEL SAFE"
> > extension script file "pg_buffercache--1.2.sql", near line 7
> >
> > While that's ok if 0003 is applied. I think that each individual patch should
> > "fully" work individually.
>
> STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
> 0002 or 0001 + 0003, but here 0002 is complaining about lack of
> pg_numa_query_pages() which is part of 0001 (?) Should I merge those
> patches or keep them separate?
>
> > === 9
> >
> > + do
> > + {
> > + if (os_page_ptrs[blk2page + j] == 0)
> >
> > blk2page + j will be repeated multiple times, better to store it in a local
> > variable instead?
>
> Done.
>
> > === 10
> >
> > + if (firstUseInBackend == true)
> >
> >
> > if (firstUseInBackend) instead?
>
> Done everywhere.
>
> > === 11
> >
> > + int j = 0,
> > + blk2page = (int) i * pages_per_blk;
> >
> > I wonder if size_t is more appropriate for blk2page:
> >
> > size_t blk2page = (size_t)(i * pages_per_blk) maybe?
>
> Sure, done.
>
> > === 12
> >
> > as we know that we'll iterate until pages_per_blk then would a for loop be more
> > appropriate here, something like?
> >
> > "
> > for (size_t j = 0; j < pages_per_blk; j++)
> > {
> > if (os_page_ptrs[blk2page + j] == 0)
> > {
> > "
>
> Sure.
>
> > === 13
> >
> > + if (buf_state & BM_DIRTY)
> > + fctx->record[i].isdirty = true;
> > + else
> > + fctx->record[i].isdirty = false;
> >
> > fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?
> >
> > === 14
> >
> > + if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> > + fctx->record[i].isvalid = true;
> > + else
> > + fctx->record[i].isvalid = false;
> >
> > fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
> > maybe?
>
> It is coming from the original pg_buffercache and it is less readable
> to me, so I don't want to touch that too much because that might open
> refactoring doors too wide.
>
> > === 15
> >
> > +populate_buffercache_entry(int i, BufferCachePagesContext *fctx)
> >
> > I wonder if "i" should get a more descriptive name?
>
> Done, buffer_id.
>
> > === 16
> >
> > s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
> > with pg_stat_io_build_tuples for example).
> >
> > I now realize that I did propose populate_buffercache_entry() up-thread,
> > sorry for changing my mind.
>
> OK, I've tried to create a better name, but all my ideas were too
> long, but pg_buffercache_build_tuple sounds nice, so lets use that.
>
> > === 17
> >
> > + static bool firstUseInBackend = true;
> >
> > maybe we should give it a more descriptive name?
>
> I couldn't come up with anything that wouldn't look too long, so
> instead I've added a comment explaining the meaning behind this
> variable, hope that's good enough.
>
> > Also I need to think more about how firstUseInBackend is used, for example:
> >
> > ==== 17.1
> >
> > would it be better to define it as a file scope variable? (at least that
> > would avoid to pass it as an extra parameter in some functions).
>
> I have no strong opinion on this, but I have one doubt (for future):
> isn't creating global variables making life harder for upcoming
> multithreading guys ?
>
> > === 17.2
> >
> > what happens if firstUseInBackend is set to false and later on the pages
> > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
> > called again by a backend that already had set firstUseInBackend to false,
> > would that provide accurate information?
>
> It is still the correct result. That "touching" (paging-in) is only
> necessary probably to properly resolve PTEs as the fork() does not
> seem to carry them over from parent:
>
> postgres=# select 'create table foo' || s || ' as select
> generate_series(1, 100000);' from generate_series(1, 4) s;
> postgres=# \gexec
> SELECT 100000
> SELECT 100000
> SELECT 100000
> SELECT 100000
> postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
> group by numa_zone_id order by numa_zone_id; -- before:
> numa_zone_id | count
> --------------+---------
> 0 | 256
> 1 | 4131
> | 8384221
> postgres=# select pg_backend_pid();
> pg_backend_pid
> ----------------
> 1451
>
> -- now use another root (!) session to "migratepages(8)" from numactl
> will also shift shm segment:
> # migratepages 1451 1 3
>
> -- while above will be in progress for a lot of time , but the outcome
> is visible much faster in that backend (pg is functioning):
> postgres=# select numa_zone_id, count(*) from pg_buffercache_numa
> group by numa_zone_id order by numa_zone_id;
> numa_zone_id | count
> --------------+---------
> 0 | 256
> 3 | 4480
> | 8383872
>
> So the above clearly shows that initial touching of shm is required,
> but just once and it stays valid afterwards.
>
> BTW: migratepages(8) was stuck for 1-2 minutes there on
> "__wait_rcu_gp" according to it's wchan, without any sign of activity
> on the OS and then out of blue completed just fine, s_b=64GB,HP=on.
>
> > === 18
> >
> > +Datum
> > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> > +{
> > + FuncCallContext *funcctx;
> > + BufferCachePagesContext *fctx; /* User function context. */
> > + bool query_numa = true;
> >
> > I don't think we need query_numa anymore in pg_buffercache_numa_pages().
> >
> > I think that we can just ERROR (or NOTICE and return) here:
> >
> > + if (pg_numa_init() == -1)
> > + {
> > + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
> > + query_numa = false;
> >
> > and fully get rid of query_numa.
>
> Right... fixed it here, but it made the tests blow up, so I've had to
> find a way to conditionally launch tests based on NUMA availability
> and that's how pg_numa_available() was born. It's in ipc/shmem.c
> because I couldn't find a better place for it...
>
> > === 19
> >
> > And then here:
> >
> > for (i = 0; i < NBuffers; i++)
> > {
> > populate_buffercache_entry(i, fctx);
> > if (query_numa)
> > pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
> > }
> >
> > if (query_numa)
> > {
> > if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
> > elog(ERROR, "failed NUMA pages inquiry: %m");
> >
> > for (i = 0; i < NBuffers; i++)
> > {
> > int blk2page = (int) i * pages_per_blk;
> >
> > /*
> > * Technically we can get errors too here and pass that to
> > * user. Also we could somehow report single DB block spanning
> > * more than one NUMA zone, but it should be rare.
> > */
> > fctx->record[i].numa_zone_id = os_pages_status[blk2page];
> > }
> >
> > maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?
>
> Well, pg_buffercache_numa_prepare_ptrs() is just an inlined wrapper
> that prepares **os_page_ptrs, which is then used by
> pg_numa_query_pages() and then we fill the data. But now after
> removing query_numa , it reads smoother anyway. Can you please take a
> look again on this, is this up to the project standards?
>
> > A few random comments regarding 0003:
> >
> > === 20
> >
> > + The <structname>pg_shmem_allocations</structname> view shows NUMA nodes
> >
> > s/pg_shmem_allocations/pg_shmem_numa_allocations/?
>
> Fixed.
>
> > === 21
> >
> > + /* FIXME: should we release LWlock here ? */
> > + elog(ERROR, "failed NUMA pages inquiry status: %m");
> >
> > There is no need, see src/backend/storage/lmgr/README.
>
> Thanks, fixed.
>
> > === 22
> >
> > +#define MAX_NUMA_ZONES 32 /* FIXME? */
> > + Size zones[MAX_NUMA_ZONES];
> >
> > could we rely on pg_numa_get_max_node() instead?
>
> Sure, done.
>
> > === 23
> >
> > + if (s >= 0)
> > + zones[s]++;
> >
> > should we also check that s is below a limit?
>
> STILL OPEN QUESTION: I'm not sure it would give us value to report
> e.g. -2 on per shm entry/per numa node entry, would it? If it would we
> could somehow overallocate that array and remember negative ones too.
>
> > === 24
> >
> > Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> > no possible holes? I mean could a system have node 0,1 and 3 but not 2?
>
> I have never seen a hole in numbering as it is established during
> boot, and the only way that could get close to adjusting it could be
> making processor books (CPU and RAM together) offline. Even *if*
> someone would be doing some preventive hardware maintenance, that
> still wouldn't hurt, as we are just using the Id of the zone to
> display it -- the max would be already higher. I mean, technically one
> could use lsmem(1) (it mentions removable flag, after let's pages and
> processes migrated away and from there) and then use chcpu(1) to
> --disable CPUs on that zone (to prevent new ones coming there and
> allocating new local memory there) and then offlining that memory
> region via chmem(1) --disable. Even with all of that, that still
> shouldn't cause issues for this code I think, because
> `numa_max_node()` says it `returns the >> highest node number <<<
> available on the current system.`
>
> > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> > did a little in 0001 though)
>
> That was an attempt to say "Thank You", OK, I've aligned it that way,
> so you are still referenced in 0001. I wouldn't find motivation to
> work on this if you won't respond to those emails ;)
>
> There is one known issue, CI returned for numa.out test, that I need
> to get bottom to (it does not reproduce for me locally) inside
> numa.out/sql:
>
> SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_numa_allocations
> WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
> WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
> WARNING: detected write past chunk end in ExprContext 0x55bb7f1a8f90
>

Hi, ok, so v11 should fix this last problem and make cfbot happy.

-J.

Attachment	Content-Type	Size
v11-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patch	application/octet-stream	27.8 KB
v11-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones.patch	application/octet-stream	15.4 KB
v11-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patch	application/octet-stream	19.0 KB

In response to

Re: Draft for basic NUMA observability at 2025-03-12 15:41:15 from Jakub Wartak

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Bertrand Drouvot	2025-03-13 09:31:29	Re: [BUG]: the walsender does not update its IO statistics until it exits
Previous Message	vignesh C	2025-03-13 09:04:15	Random pg_upgrade 004_subscription test failure on drongo