Quick Links

Re: Draft for basic NUMA observability

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Tomas Vondra <tomas(at)vondra(dot)me>
Cc:	Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Draft for basic NUMA observability
Date:	2025-04-07 15:51:29
Message-ID:	y4zhgypa4vt3txf22yzvkfe2m4rgrph25ms6ax2ukduwcl43u3@dosysiprwsha
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-04-06 13:56:54 +0200, Tomas Vondra wrote:
> On 4/6/25 01:00, Andres Freund wrote:
> > On 2025-04-05 18:29:22 -0400, Andres Freund wrote:
> >> I think one thing that the docs should mention is that calling the numa
> >> functions/views will force the pages to be allocated, even if they're
> >> currently unused.
> >>
> >> Newly started server, with s_b of 32GB an 2MB huge pages:
> >>
> >> grep ^Huge /proc/meminfo
> >> HugePages_Total: 34802
> >> HugePages_Free: 34448
> >> HugePages_Rsvd: 16437
> >> HugePages_Surp: 0
> >> Hugepagesize: 2048 kB
> >> Hugetlb: 76517376 kB
> >>
> >> run
> >> SELECT node_id, sum(size) FROM pg_shmem_allocations_numa GROUP BY node_id;
> >>
> >> Now the pages that previously were marked as reserved are actually allocated:
> >>
> >> grep ^Huge /proc/meminfo
> >> HugePages_Total: 34802
> >> HugePages_Free: 18012
> >> HugePages_Rsvd: 1
> >> HugePages_Surp: 0
> >> Hugepagesize: 2048 kB
> >> Hugetlb: 76517376 kB
> >>
> >>
> >> I don't see how we can avoid that right now, but at the very least we ought to
> >> document it.
> >
> > The only allocation where that really matters is shared_buffers. I wonder if
> > we could special case the logic for that, by only probing if at least one of
> > the buffers in the range is valid.
> >
> > Then we could treat a page status of -ENOENT as "page is not mapped" and
> > display NULL for the node_id?
> >
> > Of course that would mean that we'd always need to
> > pg_numa_touch_mem_if_required(), not just the first time round, because we
> > previously might not have for a page that is now valid. But compared to the
> > cost of actually allocating pages, the cost for that seems small.
> >
>
> I don't think this would be a good trade off. The buffers already have a
> NUMA node, and users would be interested in that.

The thing is that the buffer might *NOT* have a numa node. That's e.g. the
case in the above example - otherwise we wouldn't initially have seen the
large HugePages_Rsvd.

Forcing all those pages to be allocated via pg_numa_touch_mem_if_required()
itself wouldn't be too bad - in fact I'd rather like to have an explicit way
of doing that. The problem is that that leads to all those allocations to
happen on the *current* numa node (unless you have started postgres with
numactl --interleave=all or such), rather than the node where the normal first
use woul have allocated it.

> It's just that we don't have the memory mapped in the current backend, so
> I'd bet people would not be happy with NULL, and would proceed to force the
> allocation in some other way (say, a large query of some sort). Which
> obviously causes a lot of other problems.

I don't think that really would be the case with what I proposed? If any
buffer in the region were valid, we would force the allocation to become known
to the current backend.

Greetings,

Andres Freund

In response to

Re: Draft for basic NUMA observability at 2025-04-06 11:56:54 from Tomas Vondra

Responses

Re: Draft for basic NUMA observability at 2025-04-07 16:36:24 from Tomas Vondra

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2025-04-07 15:59:48	Re: Logging which local address was connected to in log_line_prefix
Previous Message	Tom Lane	2025-04-07 15:48:24	Re: [PoC] Reducing planning time when tables have many partitions