Quick Links

Re: Draft for basic NUMA observability

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc:	Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Draft for basic NUMA observability
Date:	2025-03-27 13:40:29
Message-ID:	tvdbhi7rkk7rvpcm22bj6jnri4ucdb4dnzefer6q7hqflpkxm7@eq37undomnxw
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-03-27 14:02:03 +0100, Jakub Wartak wrote:
> setup_additional_packages_script: |
> - #apt-get update
> - #DEBIAN_FRONTEND=noninteractive apt-get -y install ...
> + apt-get update
> + DEBIAN_FRONTEND=noninteractive apt-get -y install \
> + libnuma1 \
> + libnuma-dev

I think libnuma is installed on the relevant platforms, so you shouldn't need
to install it manually.

> +#
> +# libnuma
> +#
> +AC_MSG_CHECKING([whether to build with libnuma support])
> +PGAC_ARG_BOOL(with, libnuma, no, [use libnuma for NUMA awareness],

Most other dependencies say "build with libxyz ..."

> +/*-------------------------------------------------------------------------
> + *
> + * pg_numa.h
> + * Basic NUMA portability routines
> + *
> + *
> + * Copyright (c) 2025, PostgreSQL Global Development Group
> + *
> + * IDENTIFICATION
> + * src/include/port/pg_numa.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef PG_NUMA_H
> +#define PG_NUMA_H
> +
> +#include "c.h"
> +#include "postgres.h"

Headers should never include either of those headers. Nor should .c files
include both.

> @@ -200,6 +200,8 @@ pgxs_empty = [
>
> 'ICU_LIBS',
>
> + 'LIBNUMA_CFLAGS', 'LIBNUMA_LIBS',
> +
> 'LIBURING_CFLAGS', 'LIBURING_LIBS',
> ]

Maybe I am missing something, but are you actually defining and using those
LIBNUMA_* vars anywhere?

> +Size
> +pg_numa_get_pagesize(void)
> +{
> + Size os_page_size = sysconf(_SC_PAGESIZE);
> +
> + if (huge_pages_status == HUGE_PAGES_ON)
> + GetHugePageSize(&os_page_size, NULL);
> +
> + return os_page_size;
> +}

Should this have a comment or an assertion that it can only be used after
shared memory startup? Because before that huge_pages_status won't be
meaningful?

> +#ifndef FRONTEND
> +/*
> + * XXX: not really tested as there is no way to trigger this in our
> + * current usage of libnuma.
> + *
> + * The libnuma built-in code can be seen here:
> + * https://github.com/numactl/numactl/blob/master/libnuma.c
> + *
> + */
> +void
> +numa_warn(int num, char *fmt,...)
> +{
> + va_list ap;
> + int olde = errno;
> + int needed;
> + StringInfoData msg;
> +
> + initStringInfo(&msg);
> +
> + va_start(ap, fmt);
> + needed = appendStringInfoVA(&msg, fmt, ap);
> + va_end(ap);
> + if (needed > 0)
> + {
> + enlargeStringInfo(&msg, needed);
> + va_start(ap, fmt);
> + appendStringInfoVA(&msg, fmt, ap);
> + va_end(ap);
> + }
> +
> + ereport(WARNING,
> + (errcode(ERRCODE_EXTERNAL_ROUTINE_EXCEPTION),
> + errmsg_internal("libnuma: WARNING: %s", msg.data)));

I think you would at least have to hold interrupts across this, as
ereport(WARNING) does CHECK_FOR_INTERRUPTS() and it would not be safe to jump
out of libnuma in case an interrupt has arrived.

> +Size
> +pg_numa_get_pagesize(void)
> +{
> +#ifndef WIN32
> + Size os_page_size = sysconf(_SC_PAGESIZE);
> +#else
> + Size os_page_size;
> + SYSTEM_INFO sysinfo;
> +
> + GetSystemInfo(&sysinfo);
> + os_page_size = sysinfo.dwPageSize;
> +#endif
> + if (huge_pages_status == HUGE_PAGES_ON)
> + GetHugePageSize(&os_page_size, NULL);
> + return os_page_size;
> +}

I would probably implement this once, outside of the big ifdef, with one more
ifdef inside, given that you're sharing the same implementation.

> From fde52bfc05470076753dcb3e38a846ef3f6defe9 Mon Sep 17 00:00:00 2001
> From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
> Date: Wed, 19 Mar 2025 09:34:56 +0100
> Subject: [PATCH v16 2/4] This extracts code from contrib/pg_buffercache's
> primary function to separate functions.
>
> This commit adds pg_buffercache_init_entries(), pg_buffercache_build_tuple()
> and get_buffercache_tuple() that help fill result tuplestores based on the
> buffercache contents. This will be used in a follow-up commit that implements
> NUMA observability in pg_buffercache.
>
> Author: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
> Reviewed-by: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
> Discussion: https://postgr.es/m/CAKZiRmxh6KWo0aqRqvmcoaX2jUxZYb4kGp3N%3Dq1w%2BDiH-696Xw%40mail.gmail.com
> ---
> contrib/pg_buffercache/pg_buffercache_pages.c | 317 ++++++++++--------
> 1 file changed, 178 insertions(+), 139 deletions(-)
>
> diff --git a/contrib/pg_buffercache/pg_buffercache_pages.c b/contrib/pg_buffercache/pg_buffercache_pages.c
> index 62602af1775..86e0b8afe01 100644
> --- a/contrib/pg_buffercache/pg_buffercache_pages.c
> +++ b/contrib/pg_buffercache/pg_buffercache_pages.c
> @@ -14,7 +14,6 @@
> #include "storage/buf_internals.h"
> #include "storage/bufmgr.h"
>
> -
> #define NUM_BUFFERCACHE_PAGES_MIN_ELEM 8

Independent change.

> #define NUM_BUFFERCACHE_PAGES_ELEM 9
> #define NUM_BUFFERCACHE_SUMMARY_ELEM 5
> @@ -68,80 +67,192 @@ PG_FUNCTION_INFO_V1(pg_buffercache_summary);
> PG_FUNCTION_INFO_V1(pg_buffercache_usage_counts);
> PG_FUNCTION_INFO_V1(pg_buffercache_evict);
>
> -Datum
> -pg_buffercache_pages(PG_FUNCTION_ARGS)
> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * This is almost identical to pg_buffercache_numa_pages(), but this one performs
> + * memory mapping inquiries to display NUMA node information for each buffer.
> + */

If it's a helper routine it's probably not identical to
pg_buffercache_numa_pages()?

> +/*
> + * Helper routine for pg_buffercache_pages() and pg_buffercache_numa_pages().
> + *
> + * Build buffer cache information for a single buffer.
> + */
> +static void
> +pg_buffercache_build_tuple(int record_id, BufferCachePagesContext *fctx)
> +{

This isn't really building a tuple tuple? Seems somewhat confusing, because
get_buffercache_tuple() does actually build one.

> + BufferDesc *bufHdr;
> + uint32 buf_state;
> +
> + bufHdr = GetBufferDescriptor(record_id);
> + /* Lock each buffer header before inspecting. */
> + buf_state = LockBufHdr(bufHdr);
> +
> + fctx->record[record_id].bufferid = BufferDescriptorGetBuffer(bufHdr);
> + fctx->record[record_id].relfilenumber = BufTagGetRelNumber(&bufHdr->tag);
> + fctx->record[record_id].reltablespace = bufHdr->tag.spcOid;
> + fctx->record[record_id].reldatabase = bufHdr->tag.dbOid;
> + fctx->record[record_id].forknum = BufTagGetForkNum(&bufHdr->tag);
> + fctx->record[record_id].blocknum = bufHdr->tag.blockNum;
> + fctx->record[record_id].usagecount = BUF_STATE_GET_USAGECOUNT(buf_state);
> + fctx->record[record_id].pinning_backends = BUF_STATE_GET_REFCOUNT(buf_state);

As above, I think this would be more readable if you put
fctx->record[record_id] into a local var.

> +static Datum
> +get_buffercache_tuple(int record_id, BufferCachePagesContext *fctx)
> +{
> + Datum values[NUM_BUFFERCACHE_PAGES_ELEM];
> + bool nulls[NUM_BUFFERCACHE_PAGES_ELEM];
> + HeapTuple tuple;
> +
> + values[0] = Int32GetDatum(fctx->record[record_id].bufferid);
> + nulls[0] = false;
> +
> + /*
> + * Set all fields except the bufferid to null if the buffer is unused or
> + * not valid.
> + */
> + if (fctx->record[record_id].blocknum == InvalidBlockNumber ||
> + fctx->record[record_id].isvalid == false)
> + {
> + nulls[1] = true;
> + nulls[2] = true;
> + nulls[3] = true;
> + nulls[4] = true;
> + nulls[5] = true;
> + nulls[6] = true;
> + nulls[7] = true;
> +
> + /* unused for v1.0 callers, but the array is always long enough */
> + nulls[8] = true;

I'd probably just memset the entire nulls array to either true or false,
instead of doing it one-by-one.

> + }
> + else
> + {
> + values[1] = ObjectIdGetDatum(fctx->record[record_id].relfilenumber);
> + nulls[1] = false;
> + values[2] = ObjectIdGetDatum(fctx->record[record_id].reltablespace);
> + nulls[2] = false;
> + values[3] = ObjectIdGetDatum(fctx->record[record_id].reldatabase);
> + nulls[3] = false;
> + values[4] = ObjectIdGetDatum(fctx->record[record_id].forknum);
> + nulls[4] = false;
> + values[5] = Int64GetDatum((int64) fctx->record[record_id].blocknum);
> + nulls[5] = false;
> + values[6] = BoolGetDatum(fctx->record[record_id].isdirty);
> + nulls[6] = false;
> + values[7] = Int16GetDatum(fctx->record[record_id].usagecount);
> + nulls[7] = false;

Seems like it would end up a lot more readable if you put
fctx->record[record_id] into a local variable. Unfortunately that'd probably
be best done in one more commit ahead of the rest of the this one...

> @@ -0,0 +1,28 @@
> +SELECT NOT(pg_numa_available()) AS skip_test \gset
> +\if :skip_test
> +\quit
> +\endif

You could avoid the need for an alternative output file if you instead made
the queries do something like
SELECT NOT pg_numa_available() OR count(*) ...

> --- /dev/null
> +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> @@ -0,0 +1,42 @@
> +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> +
> +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> +
> +-- Register the new function.
> +DROP VIEW pg_buffercache;
> +DROP FUNCTION pg_buffercache_pages();

I don't think we can just drop a view in the upgrade script. That will fail if
anybody created a view depending on pg_buffercache.

(Sorry, ran out of time / energy here, i had originally just wanted to comment
on the apt-get thing in the tests)

Greetings,

Andres Freund

In response to

Re: Draft for basic NUMA observability at 2025-03-27 13:02:03 from Jakub Wartak

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andrei Lepikhov	2025-03-27 13:46:50	Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment
Previous Message	Tom Lane	2025-03-27 13:40:16	Re: Remove vardata parameters from eqjoinsel_inner