Quick Links

Re: Draft for basic NUMA observability

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Draft for basic NUMA observability
Date:	2025-04-07 16:27:46
Message-ID:	q4f4v2xnf7anfl5ucjm5dbqtnaspsq5z2ajndag35ed75avt5e@eqpqm4a5ap7q
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-04-06 13:51:34 +0200, Tomas Vondra wrote:
> On 4/6/25 00:29, Andres Freund wrote:
> >> +
> >> + if (firstNumaTouch)
> >> + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> >
> > Over the patchseries the related code is duplicated. Seems like it'd be good
> > to put it into pg_numa instead? This seems like the thing that's good to
> > abstract away in one central spot.
> >
>
> Abstract away which part, exactly? I thought about moving some of the
> code to port/pg_numa.c, but it didn't seem worth it.

The easiest would be to just have a single function that does this for the
whole shared memory allocation, without having to integrate it with the
per-allocation or per-buffer loop.

> >
> >> + /*
> >> + * If we have multiple OS pages per buffer, fill those in too. We
> >> + * always want at least one OS page, even if there are multiple
> >> + * buffers per page.
> >> + *
> >> + * Altough we could query just once per each OS page, we do it
> >> + * repeatably for each Buffer and hit the same address as
> >> + * move_pages(2) requires page aligment. This also simplifies
> >> + * retrieval code later on. Also NBuffers starts from 1.
> >> + */
> >> + for (j = 0; j < Max(1, pages_per_buffer); j++)
> >> + {
> >> + char *buffptr = (char *) BufferGetBlock(i + 1);
> >> +
> >> + fctx->record[idx].bufferid = bufferid;
> >> + fctx->record[idx].numa_page = j;
> >> +
> >> + os_page_ptrs[idx]
> >> + = (char *) TYPEALIGN(os_page_size,
> >> + buffptr + (os_page_size * j));
> >
> > FWIW, this bit here is the most expensive part of the function itself, as the
> > compiler has no choice than to implement it as an actual division, as
> > os_page_size is runtime variable.
> >
> > It'd be fine to leave it like that, the call to numa_move_pages() is way more
> > expensive. But it shouldn't be too hard to do that alignment once, rather than
> > having to do it over and over.
> >
>
> Division? It's entirely possible I'm missing something obvious, but I
> don't see any divisions in this code.

Oops. The division was only added in a subsequent patch, not the quoted
code. At the time it was:

+ /* calculate ID of the OS memory page */
+ fctx->record[idx].numa_page
+ = ((char *) os_page_ptrs[idx] - startptr) / os_page_size;

Greetings,

Andres Freund

In response to

Re: Draft for basic NUMA observability at 2025-04-06 11:51:34 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Rahila Syed	2025-04-07 16:27:57	Re: Enhancing Memory Context Statistics Reporting
Previous Message	Robert Haas	2025-04-07 16:23:06	Re: Logging which local address was connected to in log_line_prefix