Re: Draft for basic NUMA observability

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Draft for basic NUMA observability
Date: 2025-03-18 10:19:32
Message-ID: CAKZiRmzGTMbCfEcqHXhQuqc-Pb7P4iFW95JNXXjnpPtr5g_6rw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:

> Thanks for v13!

Rebased and fixes inside in the attached v14 (it passes CI too):

> Looking at 0003:
>
> === 1
>
> + <entry>NUMA mappings for shared memory allocations</entry>
>
> s/NUMA mappings/NUMA node mappings/ maybe?

Done.

> === 2
>
> + <para>
> + The <structname>pg_shmem_numa_allocations</structname> view shows NUMA nodes
> + assigned allocations made from the server's main shared memory segment.
>
> What about?
>
> "
> shows how shared memory allocations in the server's main shared memory segment
> are distributed across NUMA nodes" ?

Done.

> === 3
>
> + <structfield>numa_zone_id</structfield> <type>int4</type>
>
> s/numa_zone_id/zone_id? to be consistent with pg_buffercache_numa introduced in
> 0002.
>
> BTW, I wonder if "node_id" would be better (to match the descriptions...).
> If so, would also need to be done in 0002.

Somewhat duplicate, please see answer for #9

> === 4
>
> + ID of NUMA node
>
> <acronym>NUMA</acronym> node ID? (to be consistent with 0002).
>
> === 5
>
> +static bool firstUseInBackend = true;
>
> Let's use firstNumaTouch to be consistent with 0002.

Done.

> === 6
>
> + elog(NOTICE, "libnuma initialization failed or NUMA is not supported on this platform, some NUMA data might be unavailable.");;
>
> There is 2 ";" + I think that we should used the same wording as in
> pg_buffercache_numa_pages().
>
> === 7
>
> What about using ERROR instead? (like in pg_buffercache_numa_pages())

Both are synced now.

> === 8
>
> + /*
> + * This is for gathering some NUMA statistics. We might be using various
> + * DB block sizes (4kB, 8kB , .. 32kB) that end up being allocated in
> + * various different OS memory pages sizes, so first we need to understand
> + * the OS memory page size before calling move_pages()
> + */
> + os_page_size = pg_numa_get_pagesize();
>
> Maybe use the same comment as the one in pg_buffercache_numa_pages() before calling
> pg_numa_get_pagesize()?

Done, improved style of the comment there and synced pg_buffercache
one to shmem.c one.

> === 9
>
> + max_zones = pg_numa_get_max_node();
>
> I think we are mixing "zone" and "node". I think we should standardize on one
> and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
> vote for node, but zone is fine too if you prefer.

Given that numa(7) does not use "zone" keyword at all and both
/proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
into zones, we can conclude that "zone" is simply a subdivision within
a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
internal information about zones), therefore we should stay away from
using "zone" within the patch at all, as we are just fetching NUMA
node info. My bad, it's a terminology error on my side from start -
I've probably saw "zone" info in some command output back then when we
had that workshop and started using it and somehow it propagated
through the patchset up to this day... I've adjusted it all and
settled on "numa_node_id" column name.

> === 10
>
> + /*
> + * Preallocate memory all at once without going into details which shared
> + * memory segment is the biggest (technically min s_b can be as low as
> + * 16xBLCKSZ)
> + */
>
> What about?
>
> "
> Allocate memory for page pointers and status based on total shared memory size.
> This simplified approach allocates enough space for all pages in shared memory
> rather than calculating the exact requirements for each segment.
> " instead?

Done.

> === 11
>
> + int shm_total_page_count,
> + shm_ent_page_count,
>
> I think those 2 should be uint64.

Right...

> === 12
>
> + /*
> + * XXX: We are ignoring in NUMA version reporting of the following regions
> + * (compare to pg_get_shmem_allocations() case): 1. output shared memory
> + * allocated but not counted via the shmem index 2. output as-of-yet
> + * unused shared memory
> + */
>
> why XXX?
>
> what about?
>
> "
> We are ignoring the following memory regions (as compared to
> pg_get_shmem_allocations())....

Fixed , it was apparently leftover of when I was thinking if we should
still report it.

> === 13
>
> + page_ptrs = palloc(sizeof(void *) * shm_total_page_count);
> + memset(page_ptrs, 0, sizeof(void *) * shm_total_page_count);
>
> maybe we could use palloc0() here?

Of course!++

> === 14
>
> and I realize that we could probably use it in 0002 for os_page_ptrs.

Of course!++

> === 15
>
> I think there is still some multi-lines comments that are missing a period. I
> probably also missed some in 0002 during the previous review. I think that's
> worth another check.

Please do such a check, I've tried pgident on all .c files, but I'm
apparently blind to such issues. BTW if patch has anything left that
causes pgident to fix, that is not picked by CI but it is picked by
buildfarm??

> === 16
>
> + * In order to get reliable results we also need to touch memory
> + * pages so that inquiry about NUMA zone doesn't return -2.
> + */
>
> maybe use the same wording as in 0002?

But 0002 used:

"In order to get reliable results we also need to touch memory pages, so that
inquiry about NUMA zone doesn't return -2 (which indicates
unmapped/unallocated
pages)"

or are you looking at something different?

> === 17
>
> The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
> but (as for 0002) it looks like we can not avoid it (or at least I don't see
> a way to avoid it).

Hm, it's literally debug code. Why would we care so much if it is 2
loops rather than 1? (as stated earlier we need to pack ptrs and then
analyze it)

> I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
> updated.

Cool, thanks in advance.

-J.

Attachment Content-Type Size
v14-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patch application/octet-stream 18.8 KB
v14-0003-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patch application/octet-stream 16.1 KB
v14-0002-Extend-pg_buffercache-with-new-view-pg_buffercac.patch application/octet-stream 28.6 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shubham Khanna 2025-03-18 10:31:36 Re: Adding a '--clean-publisher-objects' option to 'pg_createsubscriber' utility.
Previous Message Amit Kapila 2025-03-18 09:55:39 Re: long-standing data loss bug in initial sync of logical replication