Quick Links

Re: Draft for basic NUMA observability

From:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To:	Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Cc:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Draft for basic NUMA observability
Date:	2025-03-31 09:27:50
Message-ID:	CAKZiRmzKEkoYUB1_PR0uijPeqihyXCwBRY7N+WN4JGPz9Nk0Yw@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Thu, Mar 27, 2025 at 2:40 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> Hi,

Hi Andres,

> On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
> > setup_additional_packages_script: |
> > - #apt-get update
> > - #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> > + apt-get update
> > + DEBIAN_FRONTEND=noninteractive apt-get -y install \
> > + libnuma1 \
> > + libnuma-dev
>
> I think libnuma is installed on the relevant platforms, so you shouldn't need
> to install it manually.

Fixed. Right, you mentioned this earlier, I just didnt know when it went online.

> > +#
> > +# libnuma
> > +#
> > +AC_MSG_CHECKING([whether to build with libnuma support])
> > +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],
>
> Most other dependencies say "build with libxyz ..."

Done.

> > + * pg_numa.h
[..]
> > +#include "c.h"
> > +#include "postgres.h"
>
> Headers should never include either of those headers. Nor should .c files
> include both.

Fixed, huh, I've found explanation:
https://www.postgresql.org/message-id/11634.1488932128@sss.pgh.pa.us

> > @@ -200,6 +200,8 @@ pgxs_empty = [
> >
> > 'ICU_LIBS',
> >
> > + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> > +
> > 'LIBURING_CFLAGS', 'LIBURING_LIBS',
> > ]
>
> Maybe I am missing something, but are you actually defining and using those
> LIBNUMA_* vars anywhere?

OK, so it seems I've been missing `PKG_CHECK_MODULES(LIBNUMA, numa)`
in configure.ac that would set those *FLAGS. I'm little bit loss
dependent in how to gurantee that meson is synced with autoconf as per
project requirements - trying to use past commits as reference, but I
still could get something wrong here (especially in
src/Makefile.global.in)

> > +Size
> > +pg_numa_get_pagesize(void)
[..]
>
> Should this have a comment or an assertion that it can only be used after
> shared memory startup? Because before that huge_pages_status won't be
> meaningful?

Added both.

> > +#ifndef FRONTEND
> > +/*
> > + * XXX: not really tested as there is no way to trigger this in our
> > + * current usage of libnuma.
> > + *
> > + * The libnuma built-in code can be seen here:
> > + * https://github.com/numactl/numactl/blob/master/libnuma.c
> > + *
> > + */
> > +void
> > +numa_warn(int num, char *fmt,...)
[..]
>
> I think you would at least have to hold interrupts across this, as
> ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
> out of libnuma in case an interrupt has arrived.

On request by Alvaro I've removed it as that code is simply
unreachable/untestable, but point taken - I'm planning to re-add this
with holding interrupting in future when we start using proper
numa_interleave() one day. Anyway, please let me know if you want
still to keep it as deadcode. BTW for context , why this is deadcode
is explained in the latter part of [1] message (TL;DR; unless we use
pining/numa_interleave/local_alloc() we probably never reach that
warnings/error handlers).

"Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine..."

> > +Size
> > +pg_numa_get_pagesize(void)
[..]
>
> I would probably implement this once, outside of the big ifdef, with one more
> ifdef inside, given that you're sharing the same implementation.

Done.

> > From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> > From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
> > Date: Wed, 19 Mar 2025 09:34:56 +0100
> > Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
> > primary function to separate functions.
> >
> > This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> > and get_buffercache_tuple() that help fill result tuplestores based on the
> > buffercache contents. This will be used in a follow-up commit that implements
> > NUMA observability in pg_buffercache.
> >
> > Author: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
> > Reviewed-by: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
> > Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> > ---
> > contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
> > 1 file changed, 178 insertions(+), 139 deletions(-)
> >
> > diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
> > index 62602af1775..86e0b8afe01 100644
> > --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> > +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> > @@ -14,7 +14,6 @@
> > #include "storage/buf_internals.h"
> > #include "storage/bufmgr.h"
> >
> > -
> > #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8
>
> Independent change.

Fixed.

> > #define NUM_BUFFERCACHE_PAGES_ELEM 9
> > #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
> > @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
> > PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
> > PG_FUNCTION_INFO_V1(pg_buffercache_evict);
> >
> > -Datum
> > -pg_buffercache_pages(PG_FUNCTION_ARGS)
> > +/*
> > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> > + *
> > + * This is almost identical to pg_buffercache_numa_pages(), but this one performs
> > + * memory mapping inquiries to display NUMA node information for each buffer.
> > + */
>
> If it's a helper routine it's probably not identical to
> pg_buffercache_numa_pages()?

Of course not, fixed by removing that comment.

> > +/*
> > + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> > + *
> > + * Build buffer cache information for a single buffer.
> > + */
> > +static void
> > +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
> > +{
>
> This isn't really building a tuple tuple? Seems somewhat confusing, because
> get_buffercache_tuple() does actually build one.

s/pg_buffercache_build_tuple/pg_buffercache_save_tuple/g , unless
someone wants to come with better name.

> > + BufferDesc *bufHdr;
> > + uint32 buf_state;
> > +
> > + bufHdr = GetBufferDescriptor(record_id);
> > + /* Lock each buffer header before inspecting. */
> > + buf_state = LockBufHdr(bufHdr);
> > +
> > + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
> > + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
> > + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
> > + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
> > + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
> > + fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
> > + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
> > + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);
>
> As above, I think this would be more readable if you put
> fctx->record[record_id] into a local var.

Done.

> > +static Datum
> > +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
> > +{
> > + Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
> > + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
> > + HeapTuple tuple;
> > +
> > + values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
> > + nulls[0] = false;
> > +
> > + /*
> > + * Set all fields except the bufferid to null if the buffer is unused or
> > + * not valid.
> > + */
> > + if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
> > + fctx->record[record_id].isvalid == false)
> > + {
> > + nulls[1] = true;
> > + nulls[2] = true;
> > + nulls[3] = true;
> > + nulls[4] = true;
> > + nulls[5] = true;
> > + nulls[6] = true;
> > + nulls[7] = true;
> > +
> > + /* unused for v1.0 callers, but the array is always long enough */
> > + nulls[8] = true;
>
> I'd probably just memset the entire nulls array to either true or false,
> instead of doing it one-by-one.

Done.

> > + }
> > + else
> > + {
> > + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
> > + nulls[1] = false;
> > + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
> > + nulls[2] = false;
> > + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
> > + nulls[3] = false;
> > + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
> > + nulls[4] = false;
> > + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
> > + nulls[5] = false;
> > + values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
> > + nulls[6] = false;
> > + values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
> > + nulls[7] = false;
>
> Seems like it would end up a lot more readable if you put
> fctx->record[record_id] into a local variable. Unfortunately that'd probably
> be best done in one more commit ahead of the rest of the this one...

Done, i've put it those refactorig changes into the commit already
dedicated only for a refactor. For the record Bertrand also asked for
something about this, but I was somehow afraid to touch Tom's code.

> > @@ -0,0 +1,28 @@
> > +SELECT NOT(pg_numa_available()) AS skip_test \gset
> > +\if :skip_test
> > +\quit
> > +\endif
>
> You could avoid the need for an alternative output file if you instead made
> the queries do something like
> SELECT NOT pg_numa_available() OR count(*) ...

ITEM REMAINING: Is this for the future or can it stay like that? I
don't have a hard opinion on this, but I've already wasted lots of
cycles to discover that one can have those ".1" alternative expected
result files.

> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,42 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the new function.
> > +DROP VIEW pg_buffercache;
> > +DROP FUNCTION pg_buffercache_pages();
>
> I don't think we can just drop a view in the upgrade script. That will fail if
> anybody created a view depending on pg_buffercache.

Ugh, fixed, thanks. That must have been some leftover (we later do
CREATE OR REPLACE those anyway).

> (Sorry, ran out of time / energy here, i had originally just wanted to comment
> on the apt-get thing in the tests)

Thanks! AIO intensifies ... :)

-J.

[1] - https://www.postgresql.org/message-id/CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA%2Bs%2B_5hcTOg%40mail.gmail.com

Attachment	Content-Type	Size
v17-0003-Extend-pg_buffercache-with-new-view-pg_buffercac.patch	application/octet-stream	17.2 KB
v17-0001-Add-optional-dependency-to-libnuma-Linux-only-fo.patch	application/octet-stream	21.4 KB
v17-0004-Add-pg_shmem_numa_allocations-to-show-NUMA-memor.patch	application/octet-stream	16.2 KB
v17-0002-This-extracts-code-from-contrib-pg_buffercache-s.patch	application/octet-stream	12.8 KB

In response to

Re: Draft for basic NUMA observability at 2025-03-27 13:40:29 from Andres Freund

Responses

Re: Draft for basic NUMA observability at 2025-03-31 14:59:13 from Bertrand Drouvot

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Mahendra Singh Thalor	2025-03-31 09:34:38	Re: Non-text mode for pg_dumpall
Previous Message	Jakub Wartak	2025-03-31 09:27:42	Re: Draft for basic NUMA observability