Re: Draft for basic NUMA observability

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Draft for basic NUMA observability
Date: 2025-02-26 08:38:20
Message-ID: CAKZiRmzpvBtqrz5Jr2DDcfk4Ar1ppsXkUhEX9RpA+s+_5hcTOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Feb 24, 2025 at 3:06 PM Andres Freund <andres(at)anarazel(dot)de> wrote:

> On 2025-02-24 12:57:16 +0100, Jakub Wartak wrote:

Hi Andres, thanks for your review!

OK first sane version attached with new src/port/pg_numa.c boilerplate
in 0001. Fixed some bugs too, there is one remaining optimization to
be done (see that `static` question later). Docs/tests are still
missing.

QQ: I'm still wondering if we there's better way of exposing multiple
pg's shma entries pointing to the same page (think of something hot:
PROCLOCK or ProcArray), so wouldn't it make sense (in some future
thread/patch) to expose this info somehow via additional column
(pg_get_shmem_numa_allocations.shared_pages bool?) ? I'm thinking of
an easy way of showing that a potential NUMA auto balancing could lead
to TLB NUMA shootdowns (not that it is happening or counting , just
identifying it as a problem in allocation). Or that stuff doesn't make
sense as we already have pg_shm_allocations.{off|size} and we could
derive such info from it (after all it is for devs?)?

postgres(at)postgres:1234 : 18843 # select
name,off,off+allocated_size,allocated_size from pg_shmem_allocations
order by off;
name | off | ?column?
| allocated_size
------------------------------------------------+-----------+-----------+----------------
[..]
Proc Header | 147114112 |
147114240 | 128
Proc Array | 147274752 |
147275392 | 640
KnownAssignedXids | 147275392 |
147310848 | 35456
KnownAssignedXidsValid | 147310848 |
147319808 | 8960
Backend Status Array | 147319808 |
147381248 | 61440

postgres(at)postgres:1234 : 18924 # select * from
pg_shmem_numa_allocations where name IN ('Proc Header', 'Proc Array',
'KnownAssignedXids', '..') order by name;
name | numa_zone_id | numa_size
-------------------+--------------+-----------
KnownAssignedXids | 0 | 2097152
Proc Array | 0 | 2097152
Proc Header | 0 | 2097152

I.e. ProcArray ends and right afterwards KnownAssignedXids start, both
are hot , but on the same HP and NUMA node

> > If there would be agreement that this is the way we want to have it
> > (from the backend and not from checkpointer), here's what's left on
> > the table to be done here:
>
> > a. isn't there something quicker for touching / page-faulting memory ?
>
> If you actually fault in a page the kernel actually has to allocate memory and
> then zero it out. That rather severely limits the throughput...

OK, no comments about that madvise(MADV_POPULATE_READ), so I'm
sticking to pointers.

> > If not then maybe add CHECKS_FOR_INTERRUPTS() there?
>
> Should definitely be there.

Added.

> > BTW I've tried additional MAP_POPULATE for PG_MMAP_FLAGS, but that didn't
> > help (it probably only works for parent//postmaster).
>
> Yes, needs to be in postmaster.
>
> Does the issue with "new" backends seeing pages as not present exist both with
> and without huge pages?

Please see attached file for more verbose results, but in short it is
like below:

patch(-touchpages) hugepages=off INVALID RESULTS (-2)
patch(-touchpages) hugepages=on INVALID RESULTS (-2)
patch(touchpages) hugepages=off CORRECT RESULT
patch(touchpages) hugepages=on CORRECT RESULT
patch(-touchpages)+MAP_POPULATE hugepages=off INVALID RESULTS (-2)
patch(-touchpages)+MAP_POPULATE hugepages=on INVALID RESULTS (-2)

IMHVO, the only other thing that could work here (but still
page-faulting) is that 5.14+ madvise(MADV_POPULATE_READ). Tests are
welcome, maybe it might be kernel version dependent.

BTW: and yes you can "feel" the timing impact of
MAP_SHARED|MAP_POPULATE during startup, it seems that for our case
child backends that don't come-up with pre-faulted page attachments
across fork() apparently.

> FWIW, what you posted fails on CI:
> https://cirrus-ci.com/task/5114213770723328
>
> Probably some ifdefs are missing. The sanity-check task configures with
> minimal dependencies, which is why you're seeing this even on linux.

Hopefully fixed, we'll see what cfbot tells, I'm flying blind with all
of this CI stuff...

> > b. refactor shared code so that it goes into src/port (but with
> > Linux-only support so far)

Done.

> > c. should we use MemoryContext in pg_get_shmem_numa_allocations or not?
>
> You mean a specific context instead of CurrentMemoryContext?

Yes, I had doubts earlier, but for now I'm going to leave it as it is.

> > diff --git a/.cirrus.tasks.yml b/.cirrus.tasks.yml
> > index 91b51142d2e..e3b7554d9e8 100644
> > --- a/.cirrus.tasks.yml
> > +++ b/.cirrus.tasks.yml
> > @@ -436,6 +436,10 @@ task:
> > SANITIZER_FLAGS: -fsanitize=address
> > PG_TEST_PG_COMBINEBACKUP_MODE: --copy-file-range
> >
> > +
> > + # FIXME: use or not the libnuma?
> > + # --with-libnuma \
> > + #
> > # Normally, the "relation segment" code basically has no coverage in our
> > # tests, because we (quite reasonably) don't generate tables large
> > # enough in tests. We've had plenty bugs that we didn't notice due
> > the
>
> I don't see why not.

Fixed.

> > diff --git a/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > new file mode 100644
> > index 00000000000..e5b3d1f7dd2
> > --- /dev/null
> > +++ b/contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql
> > @@ -0,0 +1,30 @@
> > +/* contrib/pg_buffercache/pg_buffercache--1.5--1.6.sql */
> > +
> > +-- complain if script is sourced in psql, rather than via CREATE EXTENSION
> > +\echo Use "CREATE EXTENSION pg_buffercache" to load this file. \quit
> > +
> > +-- Register the function.
> > +DROP FUNCTION pg_buffercache_pages() CASCADE;
>
> Why? I think that's going to cause problems, as the pg_buffercache view
> depends on it, and user views might turn in depend on pg_buffercache. I think
> CASCADE is rarely, if ever, ok to use in an extension scripot.

... it's just me cutting corners :^), fixed now.

[..]
> > +-- Don't want these to be available to public.
> > +REVOKE ALL ON FUNCTION pg_buffercache_pages(boolean) FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache FROM PUBLIC;
> > +REVOKE ALL ON pg_buffercache_numa FROM PUBLIC;
>
> We grant pg_monitor SELECT TO pg_buffercache, I think we should do the same
> for _numa?

Yup, fixed.

> > @@ -177,8 +228,61 @@ pg_buffercache_pages(PG_FUNCTION_ARGS)
> > else
> > fctx->record[i].isvalid = false;
> >
> > +#ifdef USE_LIBNUMA
> > +/* FIXME: taken from bufmgr.c, maybe move to .h ? */
> > +#define BufHdrGetBlock(bufHdr) ((Block) (BufferBlocks + ((Size) (bufHdr)->buf_id) * BLCKSZ))
> > + blk2page = (int) i * pages_per_blk;
>
> BufferGetBlock() is public, so I don't think BufHdrGetBlock() is needed here.

Fixed, thanks I was looking for something like this! Is that +1 in v4 good?

> > + j = 0;
> > + do {
> > + /*
> > + * Many buffers can point to the same page, but we want to
> > + * query just first address.
> > + *
> > + * In order to get reliable results we also need to touch memory pages
> > + * so that inquiry about NUMA zone doesn't return -2.
> > + */
> > + if(os_page_ptrs[blk2page+j] == 0) {
> > + volatile uint64 touch pg_attribute_unused();
> > + os_page_ptrs[blk2page+j] = (char *)BufHdrGetBlock(bufHdr) + (os_page_size*j);
> > + touch = *(uint64 *)os_page_ptrs[blk2page+j];
> > + }
> > + j++;
> > + } while(j < (int)pages_per_blk);
> > +#endif
> > +
>
> Why is this done before we even have gotten -2 back? Even if we need it, it
> seems like we ought to defer this until necessary.

Not fixed yet: maybe we could even do a `static` with
`has_this_run_earlier` and just perform this work only once during the
first time?

> > +#ifdef USE_LIBNUMA
> > + if(query_numa) {
> > + /* According to numa(3) it is required to initialize library even if that's no-op. */
> > + if(numa_available() == -1) {
> > + pg_buffercache_mark_numa_invalid(fctx, NBuffers);
> > + elog(NOTICE, "libnuma initialization failed, some NUMA data might be unavailable.");;
> > + } else {
> > + /* Amortize the number of pages we need to query about */
> > + if(numa_move_pages(0, os_page_count, os_page_ptrs, NULL, os_pages_status, 0) == -1) {
> > + elog(ERROR, "failed NUMA pages inquiry status");
> > + }
>
> I wonder if we ought to override numa_error() so we can display more useful
> errors.

Another question without an easy answer as I never hit this error from
numa_move_pages(), one gets invalid stuff in *os_pages_status instead.
BUT!: most of our patch just uses things that cannot fail as per
libnuma usage. One way to trigger libnuma warnings is e.g. `chmod 700
/sys` (because it's hard to unmount it) and then still most of numactl
stuff works as euid != 0, but numactl --hardware gets at least
"libnuma: Warning: Cannot parse distance information in sysfs:
Permission denied" or same story with numactl -C 678 date. So unless
we start way more heavy use of libnuma (not just for observability)
there's like no point in that right now (?) Contrary to that: we can
do just do variadic elog() for that, I've put some code, but no idea
if that works fine...

[..]
> Doing multiple memory allocations while holding an lwlock is probably not a
> great idea, even if the lock normally isn't contended.
[..]
> Why do this in very loop iteration?

Both fixed.

-J.

Attachment Content-Type Size
touchpages_vs_numa_inquiry_results.txt text/plain 5.4 KB
v4-0003-Add-pg_shmem_numa_allocations.patch application/octet-stream 6.4 KB
v4-0002-Extend-pg_buffercache-with-new-view-pg_buffercach.patch application/octet-stream 12.3 KB
v4-0001-Add-optional-dependency-to-libnuma-for-basic-NUMA.patch application/octet-stream 9.5 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Maxim Orlov 2025-02-26 08:46:45 Re: Proposal: Limitations of palloc inside checkpointer
Previous Message Andrey Borodin 2025-02-26 08:25:49 Re: Spinlock can be released twice in procsignal.c