Re: Draft for basic NUMA observability

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Draft for basic NUMA observability
Date: 2025-03-18 14:29:44
Message-ID: Z9mDWEj+kQzx6nmk@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote:
> On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> > Thanks for v13!
>
> Rebased and fixes inside in the attached v14 (it passes CI too):

Thanks!

> > === 9
> >
> > + max_zones = pg_numa_get_max_node();
> >
> > I think we are mixing "zone" and "node". I think we should standardize on one
> > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
> > vote for node, but zone is fine too if you prefer.
>
> Given that numa(7) does not use "zone" keyword at all and both
> /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
> into zones, we can conclude that "zone" is simply a subdivision within
> a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
> ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
> internal information about zones), therefore we should stay away from
> using "zone" within the patch at all, as we are just fetching NUMA
> node info. My bad, it's a terminology error on my side from start -
> I've probably saw "zone" info in some command output back then when we
> had that workshop and started using it and somehow it propagated
> through the patchset up to this day...

Thanks for the explanation.

> I've adjusted it all and settled on "numa_node_id" column name.

Yeah, I can see, things like:

+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>numa_node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.

and like:

+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>

I think that you re-introduced the "numa_" in the column(s) name that we get
rid (or agreed to) of previously.

I think that we can get rid of the "numa_" stuff in column(s) name as
the column(s) are part of "numa" relation views/output anyway.

I think "node_id", "size" as column(s) name should be enough.

Or maybe that re-adding "numa_" was intentional?

> > === 15
> >
> > I think there is still some multi-lines comments that are missing a period. I
> > probably also missed some in 0002 during the previous review. I think that's
> > worth another check.
>
> Please do such a check,

Found much more:

+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages)

is missing the period

and

+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */

should be as before, meaning on current HEAD:

/* Switch context when allocating stuff to be used in later calls */

and

+ /*
+ * Return to original context when allocating transient memory
+ */

should be as before, meaning on current HEAD:

/* Return to original context when allocating transient memory */

and

+ /*
+ * Note if the buffer is valid, and has storage created
+ */

should be as before, meaning on current HEAD:

/* Note if the buffer is valid, and has storage created */

and

+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */

should be as before, meaning on current HEAD:

/* unused for v1.0 callers, but the array is always long enough */

and

+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings

is missing the period

and

+ * To correctly map between them, we need to: - Determine the OS
+ * memory page size - Calculate how many OS pages are used by all
+ * buffer blocks - Calculate how many OS pages are contained within
+ * each database block

is missing the period (2 times as this comment appears in 0002 and 0003)

and

+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here

is missing the period (2 times as this comment appears in 0002 and 0003)

and

+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory

is missing the period

> I've tried pgident on all .c files, but I'm
> apparently blind to such issues.

I don't think pgident would report missing period.

> BTW if patch has anything left that
> causes pgident to fix, that is not picked by CI but it is picked by
> buildfarm??

I think it has to be done manually before each commit and that this is anyway
done at least once per release cycle.

> > === 16
> >
> > + * In order to get reliable results we also need to touch memory
> > + * pages so that inquiry about NUMA zone doesn't return -2.
> > + */
> >
> > maybe use the same wording as in 0002?
>
> But 0002 used:
>
> "In order to get reliable results we also need to touch memory pages, so that
> inquiry about NUMA zone doesn't return -2 (which indicates
> unmapped/unallocated
> pages)"
>
> or are you looking at something different?

Nope, I meant to say that it could make sense to have the exact same comment.

> > === 17
> >
> > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
> > but (as for 0002) it looks like we can not avoid it (or at least I don't see
> > a way to avoid it).
>
> Hm, it's literally debug code. Why would we care so much if it is 2
> loops rather than 1? (as stated earlier we need to pack ptrs and then
> analyze it)

Yeah, but if we could just loop one time I'm pretty sure we'd have done that.

> > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
> > updated.
> :w

> Cool, thanks in advance.

0001 looks in a good shape from my point of view.

For 0002:

=== 1

I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
deserve their own patch. That would ease the review for the "real" numa stuff.

Maybe something like:

0001 as it is
0002 introduces (and uses) pg_buffercache_init_entries() and
pg_buffercache_build_tuple()
0003 current 0002 attached minus 0002 above

We did it that way in c2a50ac678e and ff7c40d7fd6 for example.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2025-03-18 14:31:02 Re: Add -k/--link option to pg_combinebackup
Previous Message Andres Freund 2025-03-18 14:12:51 Re: optimize file transfer in pg_upgrade