From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Draft for basic NUMA observability |
Date: | 2025-04-02 14:46:53 |
Message-ID: | CAKZiRmypAJ8jsrNmEN+yDJ9CnJFA_g8KjTmnsexWq0sXsuvk=Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Apr 1, 2025 at 10:17 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> Hi,
>
> I've spent a bit of time reviewing this. In general I haven't found
> anything I'd call a bug, but here's a couple comments for v18 ... Most
> of this is in separate "review" commits, with a couple exceptions.
Hi, thank you very much for help on this, yes I did not anticipate
this patch to organically grow like that...
I've squashed those review findings into v20 and provided answers for
the "review:".
> 1) Please update the commit messages, with proper formatting, etc. I
> tried to do that in the attached v19, but please go through that, add
> relevant details, update list of reviewers, etc. The subject should not
> be overly long, etc.
Fixed by you.
> 2) I don't think we need "libnuma for NUMA awareness" in configure, I'd
> use just "libnuma support" similar to other libraries.
Fixed by you.
> 3) I don't think we need pg_numa.h to have this:
>
> extern PGDLLIMPORT Datum pg_numa_available(PG_FUNCTION_ARGS);
>
> AFAICS we don't have any SQL functions exposed as PGDLLIMPORT, so why
> would it be necessary here? It's enough to have a prototype in .c file.
Right, probably the result of ENOTENOUGHCOFFEE and copy/paste.
> 4) Improved .sgml to have acronym/productname in a couple places.
Great.
> 5) I don't think the comment for pg_buffercache_init_entries() is very
> useful. That it's helper for pg_buffercache_pages() tells me nothing
> about how to use it, what the arguments are, etc.
I've added an explanation (in 0003 though), so that this is covered.
I've always assumed that 'static' functions don't need that much of
that(?)
> 6) IMHO pg_buffercache_numa_prepare_ptrs() would deserve a better
> comment too. I mean, there's no info about what the arguments are, which
> arguments are input or output, etc. And it only discussed one option
> (block page < memory page), but what happens in the other case? The
> formulas with blk2page/blk2pageoff are not quite clear to me (I'm not
> saying it's wrong).
>
> However, it seems rather suspicious that pages_per_blk is calculated as
> float, and pg_buffercache_numa_prepare_ptrs() then does this:
>
> for (size_t j = 0; j < pages_per_blk; j++)
> { ... }
>
> I mean, isn't this vulnerable to rounding errors, which might trigger
> some weird behavior? If not, it'd be good to add a comment why this is
> fine, it confuses me a lot. I personally would probably prefer doing
> just integer arithmetic here.
Please bear with me: If you set client_min_messages to debug1 and then
pg_buffercache_numa will dump:
a) without HP, DEBUG: NUMA: os_page_count=32768 os_page_size=4096
pages_per_blk=2.00
b) with HP (2M) DEBUG: NUMA: os_page_count=64 os_page_size=2097152
pages_per_blk=0.003906
so we need to be agile to support two cases as you mention (BLCKSZ >
PAGESIZE and BLCKSZ < PAGESIZE). BLCKSZ are 2..32kB and pagesize are
4kB..1GB, thus we can get in that float the following sample values:
BLCKSZ pagesize
2kB 4kB = 0.5
2kB 2048kb = .0009765625
2kB 1024*1024kb # 1GB = .0000019073486328125 # worst-case?
8kB 4kB = 2
8kB 2048kb = .003906250 # example from above (x86_64, 2M HP)
8kB 1024*1024kb # 1GB = .00000762939453
32kB 4kB = 8
32kB 2048kb = .0156250
32kB 1024*1024kb # 1GB = .000030517578125
So that loop:
for (size_t j = 0; j < pages_per_blk; j++)
is quite generic and launches in both cases. I've somehow failed to
somehow come up with integer-based math and generic code for this
(without special cases which seem to be no-go here?). So, that loop
then will:
a) launch many times to support BLCKSZ > pagesize, that is when single
DB block spans multiple memory pages
b) launch once when BLCKSZ < pagesize (because 0.003906250 > 0 in the
example above)
Loop touches && stores addresses into os_page_ptrs[] as input to this
one big move_pages(2) query. So we basically ask for all memory pages
for NBuffers. Once we get our NUMA information we then use blk2page =
up_to_NBuffers * pages_per_blk to resolve memory pointers back to
Buffers, if anywhere it could be a problem here.
So let's say we have s_b=4TB (it wont work for sure for other reasons,
let's assume we have it), let's also assume we have no huge
pages(pagesize=4kB) and BLCKSZ=8kB (default) => NBuffers=1073741824
which multiplied by 2 = INT_MAX (integer overflow bug), so I think
that int is not big enough there in pg_buffercache_numa_pages() (it
should be "size_t blk2page" there as in
pg_buffercache_numa_prepare_ptrs(), so I've changed it in v20)
Another angle is s_b=4TB RAM with 2MB HP, BLKSZ=8kB =>
NBuffers=2097152 * 0.003906250 = 8192.0 .
OPEN_QUESTION: I'm not sure all of this is safe and I'm seeking help, but with
float f = 2097152 * 0.003906250;
under clang -Weverything I got "implicit conversion increases
floating-point precision: 'float' to 'double'", so either it is:
- we somehow rewrite all of the core arithmetics here to integer?
- or simply go with doubles just to be sure? I went with doubles in
v20, comments explaining are not there yet.
> 7) This para in the docs seems self-contradictory:
>
> <para>
> The <function>pg_buffercache_numa_pages()</function> provides the same
> information
> as <function>pg_buffercache_pages()</function> but is slower because
> it also
> provides the <acronym>NUMA</acronym> node ID per shared buffer entry.
> The <structname>pg_buffercache_numa</structname> view wraps the
> function for
> convenient use.
> </para>
>
> I mean, "provides the same information, but is slower because it
> provides different information" is strange. I understand the point, but
> maybe rephrase somehow?
Oh my... yes, now it looks way better.
> 8) Why is pg_numa_available() marked as volatile? Should not change in a
> running cluster, no?
No it shouldn't, good find, made it 's'table.
> 9) I noticed the new SGML docs have IDs with mixed "-" and "_". Maybe
> let's not do that.
>
> <sect2 id="pgbuffercache-pg-buffercache_numa">
Fixed.
> 10) I think it'd be good to mention/explain why move_pages is used
> instead of get_mempolicy - more efficient with batching, etc. This would
> be useful both in the commit message and before the move_pages call
Ok, added in 0001.
> (and in general to explain why pg_buffercache_numa_prepare_ptrs prepares the
> pointers like this etc.).
Added reference to that earlier new comment here too in 0003.
> 11) This could use UINT64_FORMAT, instead of a cast:
>
> elog(DEBUG1, "NUMA: os_page_count=%lu os_page_size=%zu
> pages_per_blk=%.2f",
> (unsigned long) os_page_count, os_page_size, pages_per_blk);
Done.
12) You have also raised "why not pg_shm_allocations_numa" instead of
"pg_shm_numa_allocations"
OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
naming things in general), I can change if you want.
13) In the patch: "review: What if we get multiple pages per buffer
(the default). Could we get multiple nodes per buffer?"
OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
output multiple rows per single buffer (with "page_no") then we could
get this:
buffer1:..:page0:numanodeID1
buffer1:..:page1:numanodeID2
buffer2:..:page0:numanodeID1
Should we add such functionality?
-J.
Attachment | Content-Type | Size |
---|---|---|
v20-0001-Add-support-for-basic-NUMA-awareness.patch | application/x-patch | 22.1 KB |
v20-0004-Add-new-pg_shmem_numa_allocations-view.patch | application/x-patch | 16.3 KB |
v20-0002-pg_buffercache-split-pg_buffercache_pages-into-p.patch | application/x-patch | 12.8 KB |
v20-0003-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch | application/x-patch | 18.7 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Bertrand Drouvot | 2025-04-02 15:00:52 | Re: Fix 035_standby_logical_decoding.pl race conditions |
Previous Message | Jakub Wartak | 2025-04-02 14:45:53 | Re: Draft for basic NUMA observability |