Re: Draft for basic NUMA observability

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: Draft for basic NUMA observability
Date: 2025-02-17 12:02:04
Message-ID: CAKZiRmzgaN-vZeoDjSHCbavU7dDyBLa1Vyp4sW=WQaZ4R43mvw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Feb 13, 2025 at 4:28 PM Bertrand Drouvot
<bertranddrouvot(dot)pg(at)gmail(dot)com> wrote:

Hi Bertrand,

Thanks for playing with this!

> Which makes me wonder if using numa_move_pages()/move_pages is the right approach. Would be curious to know if you observe the same behavior though.

You are correct, I'm observing identical behaviour, please see attached.

> Forcing the allocation to happen inside a monitoring function is decidedly not great.

We probably would need to split it to some separate and new view
within the pg_buffercache extension, but that is going to be slow, yet
still provide valid results. In the previous approach that
get_mempolicy() was allocating on 1st access, but it was slow not only
because it was allocating but also because it was just 1 syscall per
1x addr (yikes!). I somehow struggle to imagine how e.g. scanning
(really allocating) a 128GB buffer cache in future won't cause issues
- that's like 16-17mln (* 2) syscalls to be issued when not using
move_pages(2)

Another thing is that numa_maps(5) won't help us a lot too (not enough
granularity).

> But maybe we could use get_mempolicy() only on "valid" buffers i.e ((buf_state & BM_VALID) && (buf_state & BM_TAG_VALID)), thoughts?

Different perspective: I wanted to use the same approach in the new
pg_shmemallocations_numa, but that won't cut it there. The other idea
that came to my mind is to issue move_pages() from the backend that
has already used all of those pages. That literally mean on of the
below ideas:
1. from somewhere like checkpointer / bgwriter?
2. add touching memory on backend startup like always (sic!)
3. or just attempt to read/touch memory addr just before calling
move_pages(). E.g. this last options is just two lines:

if(os_page_ptrs[blk2page+j] == 0) {
+ volatile uint64 touch pg_attribute_unused();
os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) +
(os_page_size*j);
+ touch = *(uint64 *)os_page_ptrs[blk2page+j];
}

and it seems to work while still issuing much less syscalls with
move_pages() across backends, well at least here.

Frankly speaking I do not know which path to take with this, maybe
that's good enough?

-J.

Attachment Content-Type Size
numa_test.txt text/plain 1.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2025-02-17 12:03:36 Re: [PoC] Federated Authn/z with OAUTHBEARER
Previous Message Shlok Kyal 2025-02-17 11:34:26 Re: Restrict copying of invalidated replication slots