Quick Links

Re: Draft for basic NUMA observability

From:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Andres Freund <andres(at)anarazel(dot)de>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Draft for basic NUMA observability
Date:	2025-04-07 08:09:26
Message-ID:	CAKZiRmxQBcMM0e98e79OpK7_XxoLBmXfMYys-gC-iED9N+wMrQ@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Sun, Apr 6, 2025 at 3:52 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> On 4/6/25 14:57, Jakub Wartak wrote:
> > On Sun, Apr 6, 2025 at 12:29 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> >>
> >> Hi,
> >
> > Hi Andres/Tomas,
> >
> > I've noticed that Tomas responded to this while writing this, so I'm
> > attaching git-am patches based on his v25 (no squash) and there's only
> > one new (last one contains fixes based on this review) + slight commit
> > amendment to 0001.
> >
>
> I'm not working on this at the moment. I may have a bit of time in the
> evening, but more likely I'll get back to this on Monday.

OK, tried to fix all outstanding issues (except maybe some tiny code
refactors for beatification)

> >> I just played around with this for a bit. As noted somewhere further down,
> >> pg_buffercache_numa.page_num ends up wonky in different ways for the different
> >> pages.
> >
> > I think page_num is under heavy work in progress... I'm still not sure
> > is it worth exposing this (is it worth the hassle). If we scratch it
> > it won't be perfect, but we have everything , otherwise we risk this
> > feature as we are going into a feature freeze literally tomorrow.
> >
>
> IMHO it's not difficult to change the definition of page_num this way,
> it's pretty much a one line change. It's more a question of whether we
> actually want to expose this.

Bertrand noticed this first in
https://www.postgresql.org/message-id/Z/FhOOCmTxuB2h0b%40ip-10-97-1-34.eu-west-3.compute.internal
:

- startptr = (char *) BufferGetBlock(1);
+ startptr = (char *) TYPEALIGN_DOWN(os_page_size, (char
*) BufferGetBlock(1));

With the above I'm also not getting wonky (-1) results anymore. The
rest of reply assumes we are using this.

> [snip]
>
> >>> +
> >>> + if (firstNumaTouch)
> >>> + elog(DEBUG1, "NUMA: page-faulting the buffercache for proper NUMA readouts");
> >>
> >> Over the patchseries the related code is duplicated. Seems like it'd be good
> >> to put it into pg_numa instead? This seems like the thing that's good to
> >> abstract away in one central spot.
> >
> > I hear you, but we are using those statistics for per-localized-code
> > (shm.c's firstTouch != pgbuffercache.c's firstTouch).
> >
>
> Yeah, I don't moving this is quite possible / useful. We could pass the
> flag somewhere, but we still need to check & update it in local code.

No idea how such code should be looking, but it would be more code?
Are you thinking about some code still touching local &firstNumaTouch
?

> >> FWIW, neither this definition of numa_page, nor the one from "adjust page_num"
> >> works quite right for me.
> >>
> >> This definition afaict is always 0 when using huge pages and just 0 and 1 for
> >> 4k pages. But my understanding of numa_page is that it's the "id" of the numa
> >> pages, which isn't that?
> >>
> >> With "adjust page_num" I get a number that starts at -1 and then increments
> >> from there. More correct, but doesn't quite seem right either.
> >
> > Apparently handling this special case of splitted buffers edge-cases
> > was Pandora box ;) Tomas what do we do about it? Does that has chance
> > to get in before freeze?
> >
>
> I don't think the split buffers are pandora box on their own, it's more
> that it made us notice other issues / questions. I don't think handling
> it is particularly complex - the most difficult part seems to be
> figuring out how many rows we'll return, and mapping them to pages.
> But that's not very difficult, IMO.

OK, with a fresh week, and fresh mind and a different name (ospageid)
it looks better to me now.

> The bigger question is whether it's safe to do the TYPEALIGN_DOWN(),
> which may return a pointer from before the first buffer. But I guess
> that's OK, thanks to how memory is allocated - at least, that's what all
> the move_pages() examples I found do, so unless those are all broken,
> that's OK.

I agree. This took some more time, but in case of

a) pg_buffercache_numa and HP=off view we shouldn't access ptr below
buffercache, because my understanding is that shm memory would be page
aligned anyway as per BufferManagerShmemInit() which uses
TYPEALGIN(PG_IO_ALIGN_SIZE) for it anyway. So when you query
pg_buffercache_numa, one gets the following pages:

# strace -fp 14364 -e move_pages
strace: Process 14364 attached
move_pages(0, 32768, [0x7f8bfe4b5000, 0x7f8bfe4b6000, 0x7f8bfe4b7000,...

(gdb) print BufferBlocks
$1 = 0x7f8bfe4b5000 ""

while BufferBlocks actually starts there, so we are good without HP.
With HP it's move_pages(0, 16384, [0x7ff33c800000, ... vs (gdb) print
BufferBlocks
=> $1 = 0x7ff33c879000
so we are actually accessing earlier pointer (!), but Buffer Blocks is
like 15-th structure there (and there's always going be something
earlier to my understanding):

select row_number() over (order by off),* from pg_shmem_allocations
order by off asc limit 15;
row_number | name | off | size | allocated_size
------------+------------------------+---------+-----------+----------------
[..]
13 | Shared MultiXact State | 5735936 | 1148 | 1152
14 | Buffer Descriptors | 5737088 | 1048576 | 1048576
15 | Buffer Blocks | 6785664 | 134221824 | 134221824

To me this is finding in itself, with HP shared_buffers being located
on page with something else...

b) in pg_shmem_allocations_numa view without HP we don't even get to
as low as ShmemBase for move_pages(), but with HP=on we hit as low as
ShmemBase but not lower, so we are good IMHO.

Wouldn't buildfarm actually tell us this if that's bad as an insurance policy?

> >>> + <tbody>
> >>> + <row>
> >>> + <entry role="catalog_table_entry"><para role="column_definition">
> >>> + <structfield>bufferid</structfield> <type>integer</type>
> >>> + </para>
> >>> + <para>
> >>> + ID, in the range 1..<varname>shared_buffers</varname>
> >>> + </para></entry>
> >>> + </row>
> >>> +
> >>> + <row>
> >>> + <entry role="catalog_table_entry"><para role="column_definition">
> >>> + <structfield>page_num</structfield> <type>int</type>
> >>> + </para>
> >>> + <para>
> >>> + number of OS memory page for this buffer
> >>> + </para></entry>
> >>> + </row>
> >>
> >> "page_num" doesn't really seem very descriptive for "number of OS memory page
> >> for this buffer". To me "number of" sounds like it's counting the number of
> >> associated pages, but it's really just a "page id" or something like that.
> >>
> >> Maybe rename page_num to "os_page_id" and rephrase the comment a bit?
> >
> > Tomas, are you good with rename? I think I've would also prefer os_page_id.
> >
>
> Fine with me.

Done, s/page_num/ospageid/g as whole pg_buffercache does not use "_"
anywhere, so let's stick to that. Done that for nodeid too.

> >>> + /*
> >>> + * Different database block sizes (4kB, 8kB, ..., 32kB) can be used, while
> >>> + * the OS may have different memory page sizes.
> >>> + *
> >>> + * To correctly map between them, we need to: 1. Determine the OS memory
> >>> + * page size 2. Calculate how many OS pages are used by all buffer blocks
> >>> + * 3. Calculate how many OS pages are contained within each database
> >>> + * block.
> >>> + *
> >>> + * This information is needed before calling move_pages() for NUMA memory
> >>> + * node inquiry.
> >>> + */
> >>> + os_page_size = pg_numa_get_pagesize();
> >>> +
> >>> + /*
> >>> + * Allocate memory for page pointers and status based on total shared
> >>> + * memory size. This simplified approach allocates enough space for all
> >>> + * pages in shared memory rather than calculating the exact requirements
> >>> + * for each segment.
> >>> + *
> >>> + * XXX Isn't this wasteful? But there probably is one large segment of
> >>> + * shared memory, much larger than the rest anyway.
> >>> + */
> >>> + shm_total_page_count = ShmemSegHdr->totalsize / os_page_size;
> >>> + page_ptrs = palloc0(sizeof(void *) * shm_total_page_count);
> >>> + pages_status = palloc(sizeof(int) * shm_total_page_count);
> >>
> >> There's a fair bit of duplicated code here with pg_buffercache_numa_pages(),
> >> could more be moved to a shared helper function?
> >
> > -> Tomas?
> >
>
> I'm not against that in principle, but when I tried it didn't quite help
> that much. But maybe it's better with the current patch version.
>
> >>> + hash_seq_init(&hstat, ShmemIndex);
> >>> +
> >>> + /* output all allocated entries */
> >>> + memset(nulls, 0, sizeof(nulls));
> >>> + while ((ent = (ShmemIndexEnt *) hash_seq_search(&hstat)) != NULL)
> >>> + {
> >>
> >> One thing that's interesting with using ShmemIndex is that this won't get
> >> anonymous allocations. I think that's ok for now, it'd be complicated to
> >> figure out the unmapped "regions". But I guess it' be good to mention it in
> >> the ocs?
> >
> > Added.
> >

BTW: it's also in the function comment too. now.

> > [..], but I'm sending it as fast as possible
> > to avoid a clash with Tomas.
> >
>
> Please keep working on this. I may hava a bit of time in the evening,
> but in the worst case I'll merge it into your patch.

So, attached is still patchset v25, still with one-off patches (please
git am most, but just `patch -p1 < file` last two and just squash it
if you are happy. I've intentionally not squash it to provide
changelog). This LGTM.

-J.

Attachment	Content-Type	Size
v25-0005-adjust-page-alignment.patch	application/octet-stream	1.9 KB
v25-0004-Introduce-pg_shmem_allocations_numa-view.patch	application/octet-stream	17.0 KB
v25-0001-Add-support-for-basic-NUMA-awareness.patch	application/octet-stream	21.9 KB
v25-0002-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch	application/octet-stream	21.0 KB
v25-0003-adjust-page_num.patch	application/octet-stream	1.8 KB
v25-0006-fixes-for-review-by-Andres.patch	application/octet-stream	5.3 KB
v25-0007-fix-remaining-outstanding-issues-from-Sunday.patch	application/octet-stream	3.7 KB

In response to

Re: Draft for basic NUMA observability at 2025-04-06 13:52:20 from Tomas Vondra

Responses

Re: Draft for basic NUMA observability at 2025-04-07 09:53:23 from Bertrand Drouvot

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Junwang Zhao	2025-04-07 08:13:35	Re: SQL Property Graph Queries (SQL/PGQ)
Previous Message	Richard Guo	2025-04-07 07:50:20	An incorrect check in get_memoize_path