From: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
---|---|
To: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Draft for basic NUMA observability |
Date: | 2025-03-18 14:29:44 |
Message-ID: | Z9mDWEj+kQzx6nmk@ip-10-97-1-34.eu-west-3.compute.internal |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On Tue, Mar 18, 2025 at 11:19:32AM +0100, Jakub Wartak wrote:
> On Mon, Mar 17, 2025 at 5:11 PM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> > Thanks for v13!
>
> Rebased and fixes inside in the attached v14 (it passes CI too):
Thanks!
> > === 9
> >
> > + max_zones = pg_numa_get_max_node();
> >
> > I think we are mixing "zone" and "node". I think we should standardize on one
> > and use it everywhere (code and doc for both 0002 and 0003). I'm tempted to
> > vote for node, but zone is fine too if you prefer.
>
> Given that numa(7) does not use "zone" keyword at all and both
> /proc/zoneinfo and /proc/pagetypeinfo shows that NUMA nodes are split
> into zones, we can conclude that "zone" is simply a subdivision within
> a NUMA node's memory (internal kernel thing). Examples are ZONE_DMA,
> ZONE_NORMAL, ZONE_HIGHMEM. We are fetching just node id info (without
> internal information about zones), therefore we should stay away from
> using "zone" within the patch at all, as we are just fetching NUMA
> node info. My bad, it's a terminology error on my side from start -
> I've probably saw "zone" info in some command output back then when we
> had that workshop and started using it and somehow it propagated
> through the patchset up to this day...
Thanks for the explanation.
> I've adjusted it all and settled on "numa_node_id" column name.
Yeah, I can see, things like:
+ <para>
+ The definitions of the columns exposed are identical to the
+ <structname>pg_buffercache</structname> view, except that this one includes
+ one additional <structfield>numa_node_id</structfield> column as defined in
+ <xref linkend="pgbuffercache-numa-columns"/>.
and like:
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>numa_size</structfield> <type>int4</type>
+ </para>
I think that you re-introduced the "numa_" in the column(s) name that we get
rid (or agreed to) of previously.
I think that we can get rid of the "numa_" stuff in column(s) name as
the column(s) are part of "numa" relation views/output anyway.
I think "node_id", "size" as column(s) name should be enough.
Or maybe that re-adding "numa_" was intentional?
> > === 15
> >
> > I think there is still some multi-lines comments that are missing a period. I
> > probably also missed some in 0002 during the previous review. I think that's
> > worth another check.
>
> Please do such a check,
Found much more:
+ * In order to get reliable results we also need to touch memory pages, so that
+ * inquiry about NUMA memory node doesn't return -2 (which indicates
+ * unmapped/unallocated pages)
is missing the period
and
+ /*
+ * Switch context when allocating stuff to be used in later calls
+ */
should be as before, meaning on current HEAD:
/* Switch context when allocating stuff to be used in later calls */
and
+ /*
+ * Return to original context when allocating transient memory
+ */
should be as before, meaning on current HEAD:
/* Return to original context when allocating transient memory */
and
+ /*
+ * Note if the buffer is valid, and has storage created
+ */
should be as before, meaning on current HEAD:
/* Note if the buffer is valid, and has storage created */
and
+ /*
+ * unused for v1.0 callers, but the array is always long enough
+ */
should be as before, meaning on current HEAD:
/* unused for v1.0 callers, but the array is always long enough */
and
+/*
+ * This is almost identical to the above, but performs
+ * NUMA inuqiry about memory mappings
is missing the period
and
+ * To correctly map between them, we need to: - Determine the OS
+ * memory page size - Calculate how many OS pages are used by all
+ * buffer blocks - Calculate how many OS pages are contained within
+ * each database block
is missing the period (2 times as this comment appears in 0002 and 0003)
and
+ /*
+ * If we ever get 0xff back from kernel inquiry, then we probably have
+ * bug in our buffers to OS page mapping code here
is missing the period (2 times as this comment appears in 0002 and 0003)
and
+ /*
+ * We are ignoring the following memory regions (as compared to
+ * pg_get_shmem_allocations()): 1. output shared memory allocated but not
+ * counted via the shmem index 2. output as-of-yet unused shared memory
is missing the period
> I've tried pgident on all .c files, but I'm
> apparently blind to such issues.
I don't think pgident would report missing period.
> BTW if patch has anything left that
> causes pgident to fix, that is not picked by CI but it is picked by
> buildfarm??
I think it has to be done manually before each commit and that this is anyway
done at least once per release cycle.
> > === 16
> >
> > + * In order to get reliable results we also need to touch memory
> > + * pages so that inquiry about NUMA zone doesn't return -2.
> > + */
> >
> > maybe use the same wording as in 0002?
>
> But 0002 used:
>
> "In order to get reliable results we also need to touch memory pages, so that
> inquiry about NUMA zone doesn't return -2 (which indicates
> unmapped/unallocated
> pages)"
>
> or are you looking at something different?
Nope, I meant to say that it could make sense to have the exact same comment.
> > === 17
> >
> > The logic in 0003 looks ok to me. I don't like the 2 loops on shm_ent_page_count
> > but (as for 0002) it looks like we can not avoid it (or at least I don't see
> > a way to avoid it).
>
> Hm, it's literally debug code. Why would we care so much if it is 2
> loops rather than 1? (as stated earlier we need to pack ptrs and then
> analyze it)
Yeah, but if we could just loop one time I'm pretty sure we'd have done that.
> > I'll still review the whole set of patches 0001, 0002 and 0003 once 0003 is
> > updated.
> :w
> Cool, thanks in advance.
0001 looks in a good shape from my point of view.
For 0002:
=== 1
I wonder if pg_buffercache_init_entries() and pg_buffercache_build_tuple() could
deserve their own patch. That would ease the review for the "real" numa stuff.
Maybe something like:
0001 as it is
0002 introduces (and uses) pg_buffercache_init_entries() and
pg_buffercache_build_tuple()
0003 current 0002 attached minus 0002 above
We did it that way in c2a50ac678e and ff7c40d7fd6 for example.
Regards,
--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2025-03-18 14:31:02 | Re: Add -k/--link option to pg_combinebackup |
Previous Message | Andres Freund | 2025-03-18 14:12:51 | Re: optimize file transfer in pg_upgrade |