Re: Draft for basic NUMA observability

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Draft for basic NUMA observability
Date: 2025-02-13 15:28:47
Message-ID: Z64Pr8CTG0RTrGR3@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Feb 07, 2025 at 03:32:43PM +0100, Jakub Wartak wrote:
> As I have promised to Andres on the Discord hacking server some time
> ago, I'm attaching the very brief (and potentially way too rushed)
> draft of the first step into NUMA observability on PostgreSQL that was
> based on his presentation [0]. It might be rough, but it is to get us
> started. The patches were not really even basically tested, they are
> more like input for discussion - rather than solid code - to shake out
> what should be the proper form of this.
>
> Right now it gives:
>
> postgres=# select numa_zone_id, count(*) from pg_buffercache group by
> numa_zone_id;
> NOTICE: os_page_count=32768 os_page_size=4096 pages_per_blk=2.000000
> numa_zone_id | count
> --------------+-------
> | 16127
> 6 | 256
> 1 | 1

Thanks for the patch!

Not doing a code review but sharing some experimentation.

First, I had to:

@@ -99,7 +100,7 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
Size os_page_size;
void **os_page_ptrs;
int *os_pages_status;
- int os_page_count;
+ uint64 os_page_count;

and

- os_page_count = (NBuffers * BLCKSZ) / os_page_size;
+ os_page_count = ((uint64)NBuffers * BLCKSZ) / os_page_size;

to make it work with non tiny shared_buffers.

Observations:

when using 2 sessions:

Session 1 first loads buffers (e.g., by querying a relation) and then runs
'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

Session 2 does nothing but runs 'select numa_zone_id, count(*) from pg_buffercache group by numa_zone_id;'

I see a lot of '-2' for the numa_zone_id in session 2, indicating that pages appear
as unmapped when viewed from a process that hasn't accessed them, even though
those same pages appear as allocated on a NUMA node in session 1.

To double check, I created a function pg_buffercache_pages_from_pid() that is
exactly the same as pg_buffercache_pages() (with your patch) except that it
takes a pid as input and uses it in move_pages(<pid>, …).

Let me show the results:

In session 1 (that "accessed/loaded" the ~65K buffers):

postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177310
0 | 65192
-2 | 378
(3 rows)

postgres=# select pg_backend_pid();
pg_backend_pid
----------------
1662580

In session 2:

postgres=# select numa_zone_id, count(*) from pg_buffercache group by
numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177301
0 | 85
-2 | 65494
(3 rows)

^
postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(pg_backend_pid()) group by numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177301
0 | 90
-2 | 65489
(3 rows)

But when session's 1 pid is used:

postgres=# select numa_zone_id, count(*) from pg_buffercache_pages_from_pid(1662580) group by numa_zone_id;
NOTICE: os_page_count=10485760 os_page_size=4096 pages_per_blk=2.000000
numa_zone_id | count
--------------+---------
| 5177301
0 | 65195
-2 | 384
(3 rows)

Results show:

Correct NUMA distribution in session 1
Correct NUMA distribution in session 2 only when using pg_buffercache_pages_from_pid()
with the pid of session 1 as a parameter (the session that actually accessed the buffers)

Which makes me wondering if using numa_move_pages()/move_pages is the
right approach. Would be curious to know if you observe the same behavior though.

The initial idea that you shared on discord was to use get_mempolicy() but
as Andres stated:

"
One annoying thing about get_mempolicy() is this:

If no page has yet been allocated for the specified address, get_mempolicy() will allocate a page as if the thread
had performed a read (load) access to that address, and return the ID of the node where that page was allocated.

Forcing the allocation to happen inside a monitoring function is decidedly not great.
"

The man page looks correct (verified with "perf record -e page-faults,kmem:mm_page_alloc -p <pid>")
while using get_mempolicy().

But maybe we could use get_mempolicy() only on "valid" buffers i.e
((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2025-02-13 15:34:55 Re: pg_attribute_noreturn(), MSVC, C11
Previous Message Tom Lane 2025-02-13 15:08:44 Re: [Feature Request] INSERT FROZEN to Optimize Large Cold Data Imports and Migrations