Re: Draft for basic NUMA observability

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Draft for basic NUMA observability
Date: 2025-03-13 14:15:14
Message-ID: Z9LockmbFInuSt0X@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Wed, Mar 12, 2025 at 04:41:15PM +0100, Jakub Wartak wrote:
> On Mon, Mar 10, 2025 at 11:14 AM Bertrand Drouvot
> <bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:
>
> > Thanks for the new version!
>
> v10 is attached with most fixes after review and one new thing
> introduced: pg_numa_available() for run-time decision inside tests
> which was needed after simplifying code a little bit as you wanted.
> I've also fixed -Dlibnuma=disabled as it was not properly implemented.
> There are couple open items (minor/decision things), but most is fixed
> or answered:

Thanks for the new version!

> > === 2
> >
> > +else
> > + as_fn_error $? "header file <numa.h> is required for --with-libnuma" "$LINENO" 5
> > +fi
> >
> > Maybe we should add the same test (checking for numa.h) for meson?
>
> TBH, I have no idea, libnuma.h may exist but it may not link e.g. when
> cross-compiling 32-bit on 64-bit. Or is this more about keeping sync
> between meson and autoconf?

Yeah, idea was to have both in sync.

> > === 8
> >
> > # create extension pg_buffercache;
> > ERROR: could not load library "/home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so": /home/postgres/postgresql/pg_installed/pg18/lib/pg_buffercache.so: undefined symbol: pg_numa_query_pages
> > CONTEXT: SQL statement "CREATE FUNCTION pg_buffercache_pages()
> > RETURNS SETOF RECORD
> > AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> > LANGUAGE C PARALLEL SAFE"
> > extension script file "pg_buffercache--1.2.sql", near line 7
> >
> > While that's ok if 0003 is applied. I think that each individual patch should
> > "fully" work individually.
>
> STILL OPEN QUESTION: Not sure I understand: you need to have 0001 +
> 0002 or 0001 + 0003, but here 0002 is complaining about lack of
> pg_numa_query_pages() which is part of 0001 (?) Should I merge those
> patches or keep them separate?

Applying 0001 + 0002 only does produce the issue above. I think that each
sub-patch once applied should pass make check-world, that means:

0001 : should pass
0001 + 0002 : should pass
0001 + 0002 + 0003 : should pass

> > Also I need to think more about how firstUseInBackend is used, for example:
> >
> > ==== 17.1
> >
> > would it be better to define it as a file scope variable? (at least that
> > would avoid to pass it as an extra parameter in some functions).
>
> I have no strong opinion on this, but I have one doubt (for future):
> isn't creating global variables making life harder for upcoming
> multithreading guys ?

That would be "just" one more. I think it's better to use it to avoid the "current"
code using "useless" function parameters.

> > === 17.2
> >
> > what happens if firstUseInBackend is set to false and later on the pages
> > are moved to different NUMA nodes. Then pg_buffercache_numa_pages() is
> > called again by a backend that already had set firstUseInBackend to false,
> > would that provide accurate information?
>
> It is still the correct result. That "touching" (paging-in) is only
> necessary probably to properly resolve PTEs as the fork() does not
> seem to carry them over from parent:
>
> So the above clearly shows that initial touching of shm is required,
> but just once and it stays valid afterwards.

Great, thanks for the demo and the explanation!

> > === 19
> >
>
> Can you please take a look again on this

Sure, will do.

> > === 23
> >
> > + if (s >= 0)
> > + zones[s]++;
> >
> > should we also check that s is below a limit?
>
> STILL OPEN QUESTION: I'm not sure it would give us value to report
> e.g. -2 on per shm entry/per numa node entry, would it? If it would we
> could somehow overallocate that array and remember negative ones too.

I meant to say, ensure that it is below the max node number.

> > === 24
> >
> > Regarding how we make use of pg_numa_get_max_node(), are we sure there is
> > no possible holes? I mean could a system have node 0,1 and 3 but not 2?
>
> I have never seen a hole in numbering as it is established during
> boot, and the only way that could get close to adjusting it could be
> making processor books (CPU and RAM together) offline.

Yeah probably.

> Even with all of that, that still
> shouldn't cause issues for this code I think, because
> `numa_max_node()` says it `returns the >> highest node number <<<
> available on the current system.`

Yeah, I agree, thanks for clearing my doubts on it.

> > Also I don't think I'm a Co-author, I think I'm just a reviewer (even if I
> > did a little in 0001 though)
>
> That was an attempt to say "Thank You",

Thanks a lot! OTOH, I don't want to get credit for something that I did not do ;-)

I'll have a look at v11 soon.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message vignesh C 2025-03-13 14:34:11 Re: plperl version on the meson setup summary screen
Previous Message Tomas Vondra 2025-03-13 14:15:01 Re: Implement waiting for wal lsn replay: reloaded