Re: Draft for basic NUMA observability

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Draft for basic NUMA observability
Date: 2025-04-06 12:57:05
Message-ID: CAKZiRmyE-Ei150RbHAubRcv+espPMoPN49d9thFxBbnWZrsMZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,

Hi Andres/Tomas,

I've noticed that Tomas responded to this while writing this, so I'm
attaching git-am patches based on his v25 (no squash) and there's only
one new (last one contains fixes based on this review) + slight commit
amendment to 0001.

> I just played around with this for a bit. As noted somewhere further down,
> pg_buffercache_numa.page_num ends up wonky in different ways for the different
> pages.

I think page_num is under heavy work in progress... I'm still not sure
is it worth exposing this (is it worth the hassle). If we scratch it
it won't be perfect, but we have everything , otherwise we risk this
feature as we are going into a feature freeze literally tomorrow.

> I think one thing that the docs should mention is that calling the numa
> functions/views will force the pages to be allocated, even if they're
> currently unused.
[..]
> I don't see how we can avoid that right now, but at the very least we ought to
> document it.

Added statement about this.

> On 2025-04-05 16:33:28 +0200, Tomas Vondra wrote:
> > The libnuma library is not available on 32-bit builds (there's no shared
> > object for i386), so we disable it in that case. The i386 is very memory
> > limited anyway, even with PAE, so NUMA is mostly irrelevant.
>
> Hm? libnuma1:i386 installs just fine to me on debian and contains the shared
> library.

OK, removed from the commit message, as yeah google states it really
exists (somehow I couldn't find it back then at least here)...

> > +###############################################################
> > +# Library: libnuma
> > +###############################################################
> > +
> > +libnumaopt = get_option('libnuma')
> > +if not libnumaopt.disabled()
> > + # via pkg-config
> > + libnuma = dependency('numa', required: libnumaopt)
> > + if not libnuma.found()
> > + libnuma = cc.find_library('numa', required: libnumaopt)
> > + endif
>
> This fallback isn't going to work if -dlibnuma=enabled is used, as
> dependency() will error out, due to not finding a required dependency. You'd
> need to use required: false there.
>
> Do we actually need a non-dependency() fallback here? It's linux only and a
> new dependency, so just requiring a .pc file seems like it'd be fine?

I'm not sure pkg-config is present everywhere, but I'm not expernt and
AFAIR we are not consistent how various libs are handled there. It's
quite late, but for now i've now just follwed Your's recommendation
for dependency() with false.

> > +#ifdef USE_LIBNUMA
> > +
> > +/*
> > + * This is required on Linux, before pg_numa_query_pages() as we
> > + * need to page-fault before move_pages(2) syscall returns valid results.
> > + */
> > +#define pg_numa_touch_mem_if_required(ro_volatile_var, ptr) \
> > + ro_volatile_var = *(uint64 *) ptr
>
> Does it really work that way? A volatile variable to assign the result of
> dereferencing ptr ensures that the *store* isn't removed by the compiler, but
> it doesn't at all guarantee that the *load* isn't removed, since that memory
> isn't marked as volatile.
>
> I think you'd need to cast the source pointer to a volatile uint64* to ensure
> the load happens.

OK, thanks, good finding, I was not aware compiler can bypass it like that.

[..]
> > +CREATE OR REPLACE VIEW pg_buffercache_numa AS
>
> Why CREATE OR REPLACE?

Fixed.

> > +Datum
> > +pg_buffercache_numa_pages(PG_FUNCTION_ARGS)
> > +{
> > ...
> > +
> > + /*
> > + * To smoothly support upgrades from version 1.0 of this extension
> > + * transparently handle the (non-)existence of the pinning_backends
> > + * column. We unfortunately have to get the result type for that... -
> > + * we can't use the result type determined by the function definition
> > + * without potentially crashing when somebody uses the old (or even
> > + * wrong) function definition though.
> > + */
> > + if (get_call_result_type(fcinfo, NULL, &expected_tupledesc) != TYPEFUNC_COMPOSITE)
> > + elog(ERROR, "return type must be a row type");
> > +
>
> Isn't that comment inapplicable for pg_buffercache_numa_pages(), a new view?

Removed.

> > +
> > + if (firstNumaTouch)
> > + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
>
> Over the patchseries the related code is duplicated. Seems like it'd be good
> to put it into pg_numa instead? This seems like the thing that's good to
> abstract away in one central spot.

I hear you, but we are using those statistics for per-localized-code
(shm.c's firstTouch != pgbuffercache.c's firstTouch).

> > + /*
> > + * Scan through all the buffers, saving the relevant fields in the
> > + * fctx->record structure.
> > + *
> > + * We don't hold the partition locks, so we don't get a consistent
> > + * snapshot across all buffers, but we do grab the buffer header
> > + * locks, so the information of each buffer is self-consistent.
> > + *
> > + * This loop touches and stores addresses into os_page_ptrs[] as input
> > + * to one big big move_pages(2) inquiry system call. Basically we ask
> > + * for all memory pages for NBuffers.
> > + */
> > + idx = 0;
> > + for (i = 0; i < NBuffers; i++)
> > + {
> > + BufferDesc *bufHdr;
> > + uint32 buf_state;
> > + uint32 bufferid;
> > +
> > + CHECK_FOR_INTERRUPTS();
> > +
> > + bufHdr = GetBufferDescriptor(i);
> > +
> > + /* Lock each buffer header before inspecting. */
> > + buf_state = LockBufHdr(bufHdr);
> > + bufferid = BufferDescriptorGetBuffer(bufHdr);
> > +
> > + UnlockBufHdr(bufHdr, buf_state);
>
> Given that the only thing you're getting here is the buffer id, it's probably
> not worth getting the spinlock, a single 4 byte read is always atomic on our
> platforms.

Well, I think this is just copy&pasted from original function, so we
follow the pattern for consistency.

> > + /*
> > + * If we have multiple OS pages per buffer, fill those in too. We
> > + * always want at least one OS page, even if there are multiple
> > + * buffers per page.
> > + *
> > + * Altough we could query just once per each OS page, we do it
> > + * repeatably for each Buffer and hit the same address as
> > + * move_pages(2) requires page aligment. This also simplifies
> > + * retrieval code later on. Also NBuffers starts from 1.
> > + */
> > + for (j = 0; j < Max(1, pages_per_buffer); j++)
> > + {
> > + char *buffptr = (char *) BufferGetBlock(i + 1);
> > +
> > + fctx->record[idx].bufferid = bufferid;
> > + fctx->record[idx].numa_page = j;
> > +
> > + os_page_ptrs[idx]
> > + = (char *) TYPEALIGN(os_page_size,
> > + buffptr + (os_page_size * j));
>
> FWIW, this bit here is the most expensive part of the function itself, as the
> compiler has no choice than to implement it as an actual division, as
> os_page_size is runtime variable.
>
> It'd be fine to leave it like that, the call to numa_move_pages() is way more
> expensive. But it shouldn't be too hard to do that alignment once, rather than
> having to do it over and over.

TBH, I don't think we should spend lot of time optimizing, after all
it's debugging view (at the start I was actually considering putting
it as developer-only compile time option, but with shm view it is
actually usuable for others too and well... we want to have it as
foundation for real NUMA optimizations)

> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
> works quite right for me.
>
> This definition afaict is always 0 when using huge pages and just 0 and 1 for
> 4k pages. But my understanding of numa_page is that it's the "id" of the numa
> pages, which isn't that?
>
> With "adjust page_num" I get a number that starts at -1 and then increments
> from there. More correct, but doesn't quite seem right either.

Apparently handling this special case of splitted buffers edge-cases
was Pandora box ;) Tomas what do we do about it? Does that has chance
to get in before freeze?

> > + <tbody>
> > + <row>
> > + <entry role="catalog_table_entry"><para role="column_definition">
> > + <structfield>bufferid</structfield> <type>integer</type>
> > + </para>
> > + <para>
> > + ID, in the range 1..<varname>shared_buffers</varname>
> > + </para></entry>
> > + </row>
> > +
> > + <row>
> > + <entry role="catalog_table_entry"><para role="column_definition">
> > + <structfield>page_num</structfield> <type>int</type>
> > + </para>
> > + <para>
> > + number of OS memory page for this buffer
> > + </para></entry>
> > + </row>
>
> "page_num" doesn't really seem very descriptive for "number of OS memory page
> for this buffer". To me "number of" sounds like it's counting the number of
> associated pages, but it's really just a "page id" or something like that.
>
> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?

Tomas, are you good with rename? I think I've would also prefer os_page_id.

> > diff --git a/doc/src/sgml/system-views.sgml b/doc/src/sgml/system-views.sgml
[..]
> > + <para>
> > + The <structname>pg_shmem_allocations_numa</structname> shows how shared
> > + memory allocations in the server's main shared memory segment are distributed
> > + across NUMA nodes. This includes both memory allocated by
> > + <productname>PostgreSQL</productname> itself and memory allocated
> > + by extensions using the mechanisms detailed in
> > + <xref linkend="xfunc-shared-addin" />.
> > + </para>
>
> I think it'd be good to describe that the view will include multiple rows for
> each name if spread across multiple numa nodes.

Added.

> Perhaps also that querying this view is expensive and that
> pg_shmem_allocations should be used if numa information isn't needed?

Already covered by 1st finding fix.

> > + /*
> > + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
> > + * the OS may have different memory page sizes.
> > + *
> > + * To correctly map between them, we need to: 1. Determine the OS memory
> > + * page size 2. Calculate how many OS pages are used by all buffer blocks
> > + * 3. Calculate how many OS pages are contained within each database
> > + * block.
> > + *
> > + * This information is needed before calling move_pages() for NUMA memory
> > + * node inquiry.
> > + */
> > + os_page_size = pg_numa_get_pagesize();
> > +
> > + /*
> > + * Allocate memory for page pointers and status based on total shared
> > + * memory size. This simplified approach allocates enough space for all
> > + * pages in shared memory rather than calculating the exact requirements
> > + * for each segment.
> > + *
> > + * XXX Isn't this wasteful? But there probably is one large segment of
> > + * shared memory, much larger than the rest anyway.
> > + */
> > + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
> > + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
> > + pages_status = palloc(sizeof(int) * shm_total_page_count);
>
> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
> could more be moved to a shared helper function?

-> Tomas?

> > + hash_seq_init(&hstat, ShmemIndex);
> > +
> > + /* output all allocated entries */
> > + memset(nulls, 0, sizeof(nulls));
> > + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
> > + {
>
> One thing that's interesting with using ShmemIndex is that this won't get
> anonymous allocations. I think that's ok for now, it'd be complicated to
> figure out the unmapped "regions". But I guess it' be good to mention it in
> the ocs?

Added.

> > + int i;
> > +
> > + /* XXX I assume we use TYPEALIGN as a way to round to whole pages.
> > + * It's a bit misleading to call that "aligned", no? */
> > +
> > + /* Get number of OS aligned pages */
> > + shm_ent_page_count
> > + = TYPEALIGN(os_page_size, ent->allocated_size) / os_page_size;
> > +
> > + /*
> > + * If we get ever 0xff back from kernel inquiry, then we probably have
> > + * bug in our buffers to OS page mapping code here.
> > + */
> > + memset(pages_status, 0xff, sizeof(int) * shm_ent_page_count);
>
> There's obviously no guarantee that shm_ent_page_count is a multiple of
> os_page_size. I think it'd be interesting to show in the view when one shmem
> allocation shares a page with the prior allocation - that can contribute a bit
> to contention. What about showing a start_os_page_id and end_os_page_id or
> something? That could be a feature for later though.

I was thinking about it, but it could be done when analyzing this
together with data from pg_shmem_allocations(?) My worry is timing :(
Anyway, we could extend this view in future revisions.

> > +SELECT NOT(pg_numa_available()) AS skip_test \gset
> > +\if :skip_test
> > +\quit
> > +\endif
> > +-- switch to superuser
> > +\c -
> > +SELECT COUNT(*) >= 0 AS ok FROM pg_shmem_allocations_numa;
> > + ok
> > +----
> > + t
> > +(1 row)
>
> Could it be worthwhile to run the test if !pg_numa_available(), to test that
> we do the right thing in that case? We need an alternative output anyway, so
> that might be fine?

Added. the meson test passes, but I'm sending it as fast as possible
to avoid a clash with Tomas.

-J.

Attachment Content-Type Size
v25-0006-fixes-for-review-by-Andres.patch application/octet-stream 5.3 KB
v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch application/octet-stream 21.0 KB
v25-0004-Introduce-pg_shmem_allocations_numa-view.patch application/octet-stream 17.0 KB
v25-0005-adjust-page-alignment.patch application/octet-stream 1.9 KB
v25-0001-Add-support-for-basic-NUMA-awareness.patch application/octet-stream 21.9 KB
v25-0003-adjust-page_num.patch application/octet-stream 1.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tender Wang 2025-04-06 13:09:22 Re: Removing unneeded self joins
Previous Message Tomas Vondra 2025-04-06 11:56:54 Re: Draft for basic NUMA observability