Quick Links

Re: Draft for basic NUMA observability

From:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Draft for basic NUMA observability
Date:	2025-03-04 10:48:31
Message-ID:	CAKZiRmwiLyLbLjoWsC7KrdFha9XcHyfG3BMmVQYRxvdJwgSq_A@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi!

On Thu, Feb 27, 2025 at 4:34 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:

> I did some tests and it looks like it's giving correct results. I don't see -2
> anymore and every backend reports correct distribution (with or without hp,
> with "small" or "large" shared buffer).

Cool! Attached is v7 that is fully green on cirrus CI that Andres
recommended, we will see how cfbot reacts to this. BTW docs are still
missing. When started with proper interleave=all with s_b=64GB,
hugepages and 4 NUMA nodes (1socket with 4 CCDs) after small pgbench:

postgres=# select buffers_used, buffers_unused from pg_buffercache_summary();
buffers_used | buffers_unused
--------------+----------------
170853 | 8217755
(
ostgres=# select numa_zone_id, count(*) from pg_buffercache_numa group
by numa_zone_id order by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=2097152 pages_per_blk=0.003906
numa_zone_id | count
--------------+---------
0 | 42752
1 | 42752
2 | 42752
3 | 42597
| 8217755
Time: 5828.100 ms (00:05.828)
postgres=# select 3*42752+42597;
?column?
----------
170853

postgres=# select * from pg_shmem_numa_allocations order by numa_size
desc limit 12;
DEBUG: NUMA: page-faulting shared memory segments for proper NUMA readouts
name | numa_zone_id | numa_size
--------------------+--------------+-------------
Buffer Blocks | 0 | 17179869184
Buffer Blocks | 1 | 17179869184
Buffer Blocks | 3 | 17179869184
Buffer Blocks | 2 | 17179869184
Buffer Descriptors | 2 | 134217728
Buffer Descriptors | 1 | 134217728
Buffer Descriptors | 0 | 134217728
Buffer Descriptors | 3 | 134217728
Checkpointer Data | 1 | 67108864
Checkpointer Data | 0 | 67108864
Checkpointer Data | 2 | 67108864
Checkpointer Data | 3 | 67108864

Time: 68.579 ms

> A few random comments:
>
> === 1
>
[..]
> I think that the if (query_numa) check should also wrap that entire section of code.

Done.

> === 2
>
> + if (query_numa)
> + {
> + blk2page = (int) i * pages_per_blk;
> + j = 0;
> + do
> + {
>
> This check is done for every page. I wonder if it would not make sense
> to create a brand new function for pg_buffercache_numa and just let the
> current pg_buffercache_pages() as it is. That said it would be great to avoid
> code duplication as much a possible though, maybe using a shared
> populate_buffercache_entry() or such helper function?

Well, I've made query_numa a parameter there simply to avoid that code
duplication in the first place, look at those TupleDescInitEntry()...
IMHO rarely anybody uses pg_buffercache, but we could add unlikely()
there maybe to hint compiler with some smaller routine to reduce
complexity of that main routine? (assuming NUMA inquiry is going to be
rare).

> === 3
>
> +#define ONE_GIGABYTE 1024*1024*1024
> + if ((i * os_page_size) % ONE_GIGABYTE == 0)
> + CHECK_FOR_INTERRUPTS();
> + }
>
> Did you observe noticable performance impact if calling CHECK_FOR_INTERRUPTS()
> for every page instead? (I don't see with a 30GB shared buffer). I've the
> feeling that we could get rid of the "ONE_GIGABYTE" check.

You are right and no it was simply my premature optimization attempt,
as apparently CFI on closer looks seems to be already having
unlikely() and looks really cheap, so yeah I've removed that.

> === 4
>
> + pfree(os_page_ptrs);
> + pfree(os_pages_status);
>
> Not sure that's needed, we should be in a short-lived memory context here
> (ExprContext or such).

Yes, I wanted to have it just for illustrative and stylish purposes,
but right, removed.

> === 5
>
> +/* SQL SRF showing NUMA zones for allocated shared memory */
> +Datum
> +pg_get_shmem_numa_allocations(PG_FUNCTION_ARGS)
> +{
[..]
>
> + for (i = 0; i < shm_ent_page_count; i++)
> + {
> + /*
> + * In order to get reliable results we also need to touch memory
> + * pages so that inquiry about NUMA zone doesn't return -2.
> + */
> + volatile uint64 touch pg_attribute_unused();
> +
> + page_ptrs[i] = (char *) ent->location + (i * os_page_size);
> + pg_numa_touch_mem_if_required(touch, page_ptrs[i]);
>
> That sounds right.
>
> Could we also avoid some code duplication with pg_get_shmem_allocations()?

Not sure I understand do you want to avoid code duplication
pg_get_shmem_allocations() vs pg_get_shmem_numa_allocations() or
pg_get_shmem_numa_allocations() vs pg_buffercache_pages(query_numa =
true) ?

>
> Also same remarks about pfree() and ONE_GIGABYTE as above.

Fixed.

> A few other things:
>
> ==== 6
>
> +#include "port/pg_numa.h"
> Not at the right position, should be between those 2:
>
> #include "miscadmin.h"
> #include "storage/lwlock.h"

Fixed.

> ==== 7
>
> +/*-------------------------------------------------------------------------
> + *
> + * pg_numa.h
> + * Miscellaneous functions for bit-wise operations.
>
> description is not correct. Also the "Copyright (c) 2019-2025" might be
> "Copyright (c) 2025" instead.

Fixed.

> === 8
>
> +++ b/src/port/pg_numa.c
> @@ -0,0 +1,150 @@
> +/*-------------------------------------------------------------------------
> + *
> + * numa.c
> + * Basic NUMA portability routines
>
> s/numa.c/pg_numa.c/ ?

Fixed.

> === 9
>
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -6,6 +6,7 @@
> * contrib/pg_buffercache/pg_buffercache_pages.c
> *-------------------------------------------------------------------------
> */
> +#include "pg_config.h"
> #include "postgres.h"
>
> Is this new include needed?

Removed, don't remember how it arrived here, most have been some
artifact of earlier attempts.

> #include "access/htup_details.h"
> @@ -13,10 +14,12 @@
> #include "funcapi.h"
> #include "storage/buf_internals.h"
> #include "storage/bufmgr.h"
> +#include "port/pg_numa.h"
> +#include "storage/pg_shmem.h"
>
> not in the right order.

Fixed.

And also those from nearby message:

On Thu, Feb 27, 2025 at 4:42 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
> On Wed, Feb 26, 2025 at 09:38:20AM +0100, Jakub Wartak wrote:
> > On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > >
> > > Why is this done before we even have gotten -2 back? Even if we need it, it
> > > seems like we ought to defer this until necessary.
> >
> > Not fixed yet: maybe we could even do a `static` with
> > `has_this_run_earlier` and just perform this work only once during the
> > first time?
>
> Not sure I get your idea, could you share what the code would look like?

Please see pg_buffercache_pages I've just added static bool firstUseInBackend:

postgres(at)postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
DEBUG: NUMA: page-faulting the buffercache for proper NUMA readouts
[..]
postgres(at)postgres:1234 : 25103 # select numa_zone_id, count(*) from
pg_buffercache_numa group by numa_zone_id;
DEBUG: NUMA: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
[..]

Same was done to the pg_get_shmem_numa_allocations.

Also CFbot/cirrus was getting:

> [11:01:05.027] In file included from ../../src/include/postgres.h:49,
> [11:01:05.027] from pg_buffercache_pages.c:10:
> [11:01:05.027] pg_buffercache_pages.c: In function ‘pg_buffercache_pages’:
> [11:01:05.027] pg_buffercache_pages.c:194:30: error: format ‘%ld’ expects argument of type ‘long int’, but argument 3 has type ‘Size’ {aka ‘long long unsigned int’} [-Werror=format=]

Fixed with %zu (for size_t) instead of %ld.

> Linux - Debian Bookworm - Autoconf got:
> [10:42:59.216] checking numa.h usability... no
> [10:42:59.268] checking numa.h presence... no
> [10:42:59.286] checking for numa.h... no
> [10:42:59.286] configure: error: header file <numa.h> is required for --with-libnuma

I've added libnuma1 to cirrus in a similar vein like libcurl to avoid this.

> [13:50:47.449] gcc -m32 @src/backend/postgres.rsp
> [13:50:47.449] /usr/bin/ld: /usr/lib/x86_64-linux-gnu/libnuma.so: error adding symbols: file in wrong format

I've also got an error in 32-bit build as libnuma.h is there, but
apparently libnuma provides only x86_64. Anyway 32-bit(even with PAE)
and NUMA doesn't seem to make sense so I've added -Dlibnuma=off for
such build.

-J.

Attachment	Content-Type	Size
v7-0001-Add-optional-dependency-to-libnuma-Linux-only-for.patch	application/octet-stream	14.4 KB
v7-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-zones-.patch	application/octet-stream	11.1 KB
v7-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patch	application/octet-stream	16.8 KB

In response to

Re: Draft for basic NUMA observability at 2025-02-27 15:34:25 from Bertrand Drouvot

Responses

Re: Draft for basic NUMA observability at 2025-03-04 16:02:17 from Bertrand Drouvot

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Dave Cramer	2025-03-04 10:54:09	Re: [PATCH] Add native windows on arm64 support
Previous Message	Álvaro Herrera	2025-03-04 10:34:58	Re: Next commitfest app release is planned for March 18th