Re: Draft for basic NUMA observability

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Draft for basic NUMA observability
Date: 2025-03-10 10:14:07
Message-ID: Z867b09EIZncOsmV@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Mar 07, 2025 at 12:33:27PM +0100, Jakub Wartak wrote:
> On Fri, Mar 7, 2025 at 11:20 AM Jakub Wartak
> <jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
> >
> > Hi,
> > On Wed, Mar 5, 2025 at 10:30 AM Jakub Wartak
> > <jakub(dot)wartak(at)enterprisedb(dot)com> wrote:
> > >Hi,
> >
> > > > Yeah, that's why I was mentioning to use a "shared" populate_buffercache_entry()
> > > > or such function: to put the "duplicated" code in it and then use this
> > > > shared function in pg_buffercache_pages() and in the new numa related one.
> > >
> > > OK, so hastily attempted that in 7b , I had to do a larger refactor
> > > there to avoid code duplication between those two. I don't know which
> > > attempt is better though (7 vs 7b)..
> > >
> >
> > I'm attaching basically the earlier stuff (v7b) as v8 with the
> > following minor changes:
> > - docs are included
> > - changed int8 to int4 in one function definition for numa_zone_id
>
> .. and v9 attached because cfbot partially complained about
> .cirrus.tasks.yml being adjusted recently (it seems master is hot
> these days).
>

Thanks for the new version!

Some random comments on 0001:

=== 1

It does not compiles "alone". It's missing:

+++ b/src/include/storage/pg_shmem.h
@@ -45,6 +45,7 @@ typedef struct PGShmemHeader /* standard header for all Postgres shmem */
extern PGDLLIMPORT int shared_memory_type;
extern PGDLLIMPORT int huge_pages;
extern PGDLLIMPORT int huge_page_size;
+extern PGDLLIMPORT int huge_pages_status;

and

--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -563,7 +563,7 @@ static int ssl_renegotiation_limit;
*/
int huge_pages = HUGE_PAGES_TRY;
int huge_page_size;
-static int huge_pages_status = HUGE_PAGES_UNKNOWN;
+int huge_pages_status = HUGE_PAGES_UNKNOWN;

That come with 0002. So it has to be in 0001 instead.

=== 2

+else
+ as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
+fi

Maybe we should add the same test (checking for numa.h) for meson?

=== 3

+# FIXME: filter-out / with/without with_libnuma?
+LIBS += $(LIBNUMA_LIBS)

It looks to me that we can remove those 2 lines.

=== 4

+ Only supported on Linux.

s/on Linux/on platforms for which the libnuma library is implemented/?

I did a quick grep on "Only supported on" and it looks like that could be
a more consistent wording.

=== 5

+#include "c.h"
+#include "postgres.h"
+#include "port/pg_numa.h"
+#include "storage/pg_shmem.h"
+#include <unistd.h>

I had a closer look to other header files and it looks like it "should" be:

#include "c.h"
#include "postgres.h"

#include <unistd.h>
#ifdef WIN32
#include <windows.h>
#endif

#include "port/pg_numa.h"
#include "storage/pg_shmem.h"

And is "#include "c.h"" really needed?

=== 6

+/* FIXME not tested, might crash */

That's a bit scary.

=== 7

+ * XXX: for now we issue just WARNING, but long-term that might depend on
+ * numa_set_strict() here

s/here/here./

Did not look carefully at all the comments in 0001, 0002 and 0003 though.

A few random comments regarding 0002:

=== 8

# create extension pg_buffercache;
ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
RETURNS SETOF RECORD
AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
LANGUAGE C PARALLEL SAFE"
extension script file "pg_buffercache--1.2.sql", near line 7

While that's ok if 0003 is applied. I think that each individual patch should
"fully" work individually.

=== 9

+ do
+ {
+ if (os_page_ptrs[blk2page + j] == 0)

blk2page + j will be repeated multiple times, better to store it in a local
variable instead?

=== 10

+ if (firstUseInBackend == true)

if (firstUseInBackend) instead?

=== 11

+ int j = 0,
+ blk2page = (int) i * pages_per_blk;

I wonder if size_t is more appropriate for blk2page:

size_t blk2page = (size_t)(i * pages_per_blk) maybe?

=== 12

as we know that we'll iterate until pages_per_blk then would a for loop be more
appropriate here, something like?

"
for (size_t j = 0; j < pages_per_blk; j++)
{
if (os_page_ptrs[blk2page + j] == 0)
{
"

=== 13

+ if (buf_state & BM_DIRTY)
+ fctx->record[i].isdirty = true;
+ else
+ fctx->record[i].isdirty = false;

fctx->record[i].isdirty = (buf_state & BM_DIRTY) != 0 maybe?

=== 14

+ if ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
+ fctx->record[i].isvalid = true;
+ else
+ fctx->record[i].isvalid = false;

fctx->record[i].isvalid = ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID))
maybe?

=== 15

+populate_buffercache_entry(int i, BufferCachePagesContext *fctx)

I wonder if "i" should get a more descriptive name?

=== 16

s/populate_buffercache_entry/pg_buffercache_build_tuple/? (to be consistent
with pg_stat_io_build_tuples for example).

I now realize that I did propose populate_buffercache_entry() up-thread,
sorry for changing my mind.

=== 17

+ static bool firstUseInBackend = true;

maybe we should give it a more descriptive name?

Also I need to think more about how firstUseInBackend is used, for example:

==== 17.1

would it be better to define it as a file scope variable? (at least that
would avoid to pass it as an extra parameter in some functions).

=== 17.2

what happens if firstUseInBackend is set to false and later on the pages
are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
called again by a backend that already had set firstUseInBackend to false,
would that provide accurate information?

=== 18

+Datum
+pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
+{
+ FuncCallContext *funcctx;
+ BufferCachePagesContext *fctx; /* User function context. */
+ bool query_numa = true;

I don't think we need query_numa anymore in pg_buffercache_numa_pages().

I think that we can just ERROR (or NOTICE and return) here:

+ if (pg_numa_init() == -1)
+ {
+ elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");
+ query_numa = false;

and fully get rid of query_numa.

=== 19

And then here:

for (i = 0; i < NBuffers; i++)
{
populate_buffercache_entry(i, fctx);
if (query_numa)
pg_buffercache_numa_prepare_ptrs(i, pages_per_blk, os_page_size, os_page_ptrs, firstUseInBackend);
}

if (query_numa)
{
if (pg_numa_query_pages(0, os_page_count, os_page_ptrs, os_pages_status) == -1)
elog(ERROR, "failed NUMA pages inquiry: %m");

for (i = 0; i < NBuffers; i++)
{
int blk2page = (int) i * pages_per_blk;

/*
* Technically we can get errors too here and pass that to
* user. Also we could somehow report single DB block spanning
* more than one NUMA zone, but it should be rare.
*/
fctx->record[i].numa_zone_id = os_pages_status[blk2page];
}

maybe we can just loop a single time over "for (i = 0; i < NBuffers; i++)"?

A few random comments regarding 0003:

=== 20

+ <indexterm zone="view-pg-shmem-numa-allocations">
+ <primary>pg_shmem_numa_allocations</primary>
+ </indexterm>
+
+ <para>
+ The <structname>pg_shmem_allocations</structname> view shows NUMA nodes

s/pg_shmem_allocations/pg_shmem_numa_allocations/?

=== 21

+ /* FIXME: should we release LWlock here ? */
+ elog(ERROR, "failed NUMA pages inquiry status: %m");

There is no need, see src/backend/storage/lmgr/README.

=== 22

+#define MAX_NUMA_ZONES 32 /* FIXME? */
+ Size zones[MAX_NUMA_ZONES];

could we rely on pg_numa_get_max_node() instead?

=== 23

+ if (s >= 0)
+ zones[s]++;

should we also check that s is below a limit?

=== 24

Regarding how we make use of pg_numa_get_max_node(), are we sure there is
no possible holes? I mean could a system have node 0,1 and 3 but not 2?

Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
did a little in 0001 though)

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bykov Ivan 2025-03-10 10:17:45 RE: Query ID Calculation Fix for DISTINCT / ORDER BY and LIMIT / OFFSET
Previous Message Steven Niu 2025-03-10 10:07:34 Re: [Patch] remove duplicated smgrclose