Re: Draft for basic NUMA observability

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>
Cc: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Draft for basic NUMA observability
Date: 2025-04-03 11:52:04
Message-ID: fcae9008-a312-4aef-9252-196f1891cdb9@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 4/3/25 09:01, Jakub Wartak wrote:
> On Wed, Apr 2, 2025 at 6:40 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
>
> Hi Tomas,
>
>> OK, so you agree the commit messages are complete / correct?
>
> Yes.
>
>> OK. FWIW if you disagree with some of my proposed changes, feel free to
>> push back. I'm sure some may be more a matter of personal preference.
>
> No, it's all fine. I will probably have lots of questions about
> setting proper env for development that cares itself about style, but
> that's for another day.
>
>> [..floats..]
>> Hmmm, OK. Maybe it's correct. I still find the float arithmetic really
>> confusing and difficult to reason about ...
>>
>> I agree we don't want special cases for each possible combination of
>> page sizes (I'm not sure we even know all the combinations). What I was
>> thinking about is two branches, one for (block >= page) and another for
>> (block < page). AFAICK both values have to be 2^k, so this would
>> guarantee we have either (block/page) or (page/block) as integer.
>>
>> I wonder if you could even just calculate both, and have one loop that
>> deals with both.
>>
> [..]
>> When I say "integer arithmetic" I don't mean it should use 32-bit ints,
>> or any other data type. I mean that it works with non-floating point
>> values. It could be int64, Size or whatever is large enough to not
>> overflow. I really don't see how changing stuff to double makes this
>> easier to understand.
>
> I hear you, attached v21 / 0003 is free of float/double arithmetics
> and uses non-float point values. It should be more readable too with
> those comments. I have not put it into its own function, because now
> it fits the whole screen, so hopefully one can follow visually. Please
> let me know if that code solves the doubts or feel free to reformat
> it. That _numa_prepare_ptrs() is unused and will need to be removed,
> but we can still move some code there if necessary.
>

IMHO the code in v21 is much easier to understand. It's not quite clear
to me why it's done outside pg_buffercache_numa_prepare_ptrs(), though.

>>> 12) You have also raised "why not pg_shm_allocations_numa" instead of
>>> "pg_shm_numa_allocations"
>>>
>>> OPEN_QUESTION: To be honest, I'm not attached to any of those two (or
>>> naming things in general), I can change if you want.
>>>
>>
>> Me neither. I wonder if there's some precedent when adding similar
>> variants for other catalogs ... can you check? I've been thinking about
>> pg_stats and pg_stats_ext, but maybe there's a better example?
>
> Hm, it seems we always go with suffix "_somethingnew":
>
> * pg_stat_database -> pg_stat_database_conflicts
> * pg_stat_subscription -> pg_stat_subscription_stats
> * even here: pg_buffercache -> pg_buffercache_numa
>
> @Bertrand: do you have anything against pg_shm_allocations_numa
> instead of pg_shm_numa_allocations? I don't mind changing it...
>

+1 to pg_shmem_allocations_numa

>>> 13) In the patch: "review: What if we get multiple pages per buffer
>>> (the default). Could we get multiple nodes per buffer?"
>>>
>>> OPEN_QUESTION: Today no, but if we would modify pg_buffercache_numa to
>>> output multiple rows per single buffer (with "page_no") then we could
>>> get this:
>>> buffer1:..:page0:numanodeID1
>>> buffer1:..:page1:numanodeID2
>>> buffer2:..:page0:numanodeID1
>>>
>>> Should we add such functionality?
>>
>> When you say "today no" does that mean we know all pages will be on the
>> same node, or that there may be pages from different nodes and we can't
>> display that? That'd not be great, IMHO.
>>
>> I'm not a huge fan of returning multiple rows per buffer, with one row
>> per page. So for 8K blocks and 4K pages we'd have 2 rows per page. The
>> rest of the fields is for the whole buffer, it'd be wrong to duplicate
>> that for each page.
>
> OPEN_QUESTION: With v21 we have all the information available, we are
> just unable to display this in pg_buffercache_numa right now. We could
> trim the view so that it has 3 columns (and user needs to JOIN to
> pg_buffercache for more details like relationoid), but then what the
> whole refactor (0002) was for if we would just return bufferId like
> below:
>
> buffer1:page0:numanodeID1
> buffer1:page1:numanodeID2
> buffer2:page0:numanodeID1
> buffer2:page1:numanodeID1
>
> There's also the problem that reading/joining could be inconsistent
> and even slower.
>

I think a view with just 3 columns would be a good solution. It's what
pg_shmem_allocations_numa already does, so it'd be consistent with that
part too.

I'm not too worried about the cost of the extra join - it's going to be
a couple dozen milliseconds at worst, I guess, and that's negligible in
the bigger scheme of things (e.g. compared to how long the move_pages is
expected to take). Also, it's not like having everything in the same
view is free - people would have to do some sort of post-processing, and
that has a cost too.

So unless someone can demonstrate a use case where this would matter,
I'd not worry about it too much.

>> I wonder if we should have a bitmap of nodes for the buffer (but then
>> what if there are multiple pages from the same node?), or maybe just an
>> array of nodes, with one element per page.
>
> AFAIR this has been discussed back in end of January, and the
> conclusion was more or less - on Discord - that everything sucks
> (bitmaps, BIT() datatype, arrays,...) either from implementation or
> user side, but apparently arrays [] would suck the least from
> implementation side. So we could probably do something like up to
> node_max_nodes():
> buffer1:..:{0, 2, 0, 0}
> buffer2:..:{0, 1, 0, 1} #edgecase: buffer across 2 NUMA nodes
> buffer3:..:{0, 0, 0, 2}
>
> Other idea is JSON or even simple string with numa_node_id<->count:
> buffer1:..:"1=2"
> buffer2:..:"1=1 3=1" #edgecase: buffer across 2 NUMA nodes
> buffer3:..:"3=2"
>
> I find all of those non-user friendly and I'm afraid I won't be able
> to pull that alone in time...

I'm -1 on JSON, I don't see how would that solve anything better than
e.g. a regular array, and it's going to be harder to work with. So if we
don't want to go with the 3-column view proposed earlier, I'd stick to a
simple array. I don't think there's a huge difference between those two
approaches, it should be easy to convert between those approaches using
unnest() and array_agg().

Attached is v22, with some minor review comments:

1) I suggested we should just use "libnuma support" in configure,
instead of talking about "NUMA awareness support", and AFAICS you
agreed. But I still see the old text in configure ... is that
intentional or a bit of forgotten text?

2) I added a couple asserts to pg_buffercache_numa_pages() and comments,
and simplified a couple lines (but that's a matter of preference).

3) I don't think it's correct for pg_get_shmem_numa_allocations to just
silently ignore nodes outside the valid range. I suggest we simply do
elog(ERROR), as it's an internal error we don't expect to happen.

regards

--
Tomas Vondra

Attachment Content-Type Size
v22-0001-Add-support-for-basic-NUMA-awareness.patch text/x-patch 22.1 KB
v22-0002-review.patch text/x-patch 2.0 KB
v22-0003-pg_buffercache-split-pg_buffercache_pages-into-p.patch text/x-patch 12.8 KB
v22-0004-Add-pg_buffercache_numa-view-with-NUMA-node-info.patch text/x-patch 19.5 KB
v22-0005-review.patch text/x-patch 3.0 KB
v22-0006-Add-new-pg_shmem_numa_allocations-view.patch text/x-patch 16.3 KB
v22-0007-review.patch text/x-patch 886 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ranier Vilela 2025-04-03 11:55:09 Re: libpq support for NegotiateProtocolVersion
Previous Message Fujii Masao 2025-04-03 11:46:59 Re: in BeginCopyTo make materialized view using COPY TO instead of COPY (query).