NUMA shared memory interleaving

From: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tomas Vondra <tomas(at)vondra(dot)me>, Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Subject: NUMA shared memory interleaving
Date: 2025-04-16 09:14:32
Message-ID: CAKZiRmw6i1W1AwXxa-Asrn8wrVcVH3TO715g_MCoowTS9rkGyw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks to having pg_numa.c, we can now simply address problem#2 of
NUMA imbalance from [1] pages 11-14, by interleaving shm memory in
PG19 - patch attached. We do not need to call numa_set_localalloc() as
we only interleave shm segments, while local allocations stay the same
(well, "local" means relative to the CPU asking for private memory).
Below is result from legacy 4s32t64 Sandy Bridge EP box with low NUMA
(QPI) interconnect bandwidth to better illustrate the problem (it's
little edgecase, but some one may hit it):

Testcase:
small SB (here it was 4GB*) that fully fits NUMA hugepage zone as
this was tested with hugepages=on

$ cat seqconcurrscans.pgb
\set num (:client_id % 8) + 1
select sum(octet_length(filler)) from pgbench_accounts_:num;

/usr/local/pgsql/bin/pg_ctl -D /db/data -l logfile restart
/usr/local/pgsql/bin/psql -c "select
pg_prewarm('pgbench_accounts_'||s) from generate_series(1, 8) s;"
#load all using current policy
/usr/local/pgsql/bin/psql -c "select * from
pg_shmem_allocations_numa where name = 'Buffer Blocks';"
/usr/local/pgsql/bin/pgbench -c 64 -j 8 -P 1 -T 60 -f seqconcurrscans.pgb

on master and numa=off (default) and in previous versions:
name | numa_node | size
---------------+-----------+------------
Buffer Blocks | 0 | 0
Buffer Blocks | 1 | 0
Buffer Blocks | 2 | 4297064448
Buffer Blocks | 3 | 0

latency average = 1826.324 ms
latency stddev = 665.567 ms
tps = 34.708151 (without initial connection time)

on master and numa=on:
name | numa_node | size
---------------+-----------+------------
Buffer Blocks | 0 | 1073741824
Buffer Blocks | 1 | 1073741824
Buffer Blocks | 2 | 1075838976
Buffer Blocks | 3 | 1073741824

latency average = 1002.288 ms
latency stddev = 214.392 ms
tps = 63.344814 (without initial connection time)

Normal pgbench workloads tend to be not affected, as each backend
tends to touch just a small partition of shm (thanks to BAS
strategies). Some remaining questions are:
1. How to name this GUC (numa or numa_shm_interleave) ? I prefer the
first option, as we could potentially in future add more optimizations
behind that GUC.
2. Should we also interleave DSA/DSM for Parallel Query? (I'm not an
expert on DSA/DSM at all)
3. Should we fail to start if we numa=on on an unsupported platform?

* interesting tidbit to get reliable measurement: one needs to double
check that s_b (hugepage allocation) is smaller than per-NUMA zone
free hugepages (s_b fits static hugepage allocation within a single
zone). This shouldn't be a problem on 2 sockets (as most of the time
there, s_b is < 50% RAM anyway, well usually 26-30% with some stuff by
max_connections, it's higher than 25% but people usually sysctl
nr_hugepages=25%RAM) , but with >= 4 NUMA nodes (4 sockets or some
modern MCMs) kernel might start spilling the s_b (> 25%) to the other
NUMA node on it's own, so it's best to verify it using
pg_shmem_allocations_numa...

-J.

[1] - https://anarazel.de/talks/2024-10-23-pgconf-eu-numa-vs-postgresql/numa-vs-postgresql.pdf

Attachment Content-Type Size
v1-0001-Add-capability-to-interleave-shared-memory-across.patch application/octet-stream 6.4 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Konstantin Osipov 2025-04-16 09:47:00 Re: Built-in Raft replication
Previous Message Alexander Kuznetsov 2025-04-16 08:27:37 pg_dump: Fix dangling pointer in EndCompressorZstd()