Re: Improve monitoring of shared memory allocations

From: Tomas Vondra <tomas(at)vondra(dot)me>
To: Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc: Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Improve monitoring of shared memory allocations
Date: 2025-03-27 23:50:42
Message-ID: 6705dbd2-060b-4f3c-9fcb-1c7f10880b26@vondra.me
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3/27/25 13:56, Tomas Vondra wrote:
> ...
>
> OK, I don't have any other comments for 0001 and 0002. I'll do some
> more review and polishing on those, and will get them committed soon.
>

Actually ... while polishing 0001 and 0002, I noticed a couple more
details that I'd like to ask about. I also ran pgindent and tweaked the
formatting, so some of the diff is caused by that.

1) alignment

There was a comment with a question whether we need to MAXALIGN the
chunks in dynahash.c, which were originally allocated by ShmemAlloc, but
now it's part of one large allocation, which is then cut into pieces
(using pointer arithmetics).

I was not sure whether we need to enforce some alignment, we briefly
discussed that off-list. I realize you chose to add the alignment, but I
haven't noticed any comment in the patch why it's needed, and it seems
to me it may not be quite correct.

Let me explain what I had in mind, and why I think the way v5 doesn't
actually do that. It took me a while before I understood what alignment
is about, and for a while it was haunting my patches, so hopefully this
will help others ...

The "alignment" is about pointers (or addresses), and when a pointer is
aligned it means the address is a multiple of some number. For example
4B-aligned pointer is a multiple of 4B, so 0x00000100 is 4B-aligned,
while 0x00000101 is not. Sometimes we use data types to express the
alignment, e.g. int-aligned is 4B-aligned, but that's a detail. AFAIK
the alignment is always 2^k, so 1, 2, 4, 8, ...

The primary reason for alignment is that some architectures require the
pointers to be well-aligned for a given data type. For example (int*)
needs to be int-aligned. If you have a pointer that's not 4B-aligned,
it'll trigger SIGBUS or maybe SIGSEGV. This was true for architectures
like powerpc, I don't think x86/arm64 have this restriction, i.e. it'd
work, even if there might be a minor performance impact. Anyway, we
still enforce/expect correct alignment, because we may still support
some of those alignment-sensitive platforms, and it's "tidy".

The other reason is that we sometimes use alignment to add padding, to
reduce contention when accessing elements in hot arrays. We want to
align to cacheline boundaries, so that a struct does not require
accessing more cachelines than really necessary. And also to reduce
contention - the more cachelines, the higher the risk of contention.

Now, back to the patch. The code originally did this in ShmemInitStruct

hashp = ShmemInitStruct(...)

to allocate the hctl, and then

firstElement = (HASHELEMENT *) ShmemAlloc(nelem * elementSize);

in element_alloc(). But this means the "elements" allocation is aligned
to PG_CACHE_LINE_SIZE, i.e. 128B, because ShmemAllocRaw() does this:

size = CACHELINEALIGN(size);

So it distributes memory in multiples of 128B, and I believe it starts
at a multiple of 128B.

But the patch reworks this to allocate everything at once, and thus it
won't get this alignment automatically. AFAIK that's not intentional,
because no one explicitly mentioned this. And it's may not be quite
desirable, judging by the comment in ShmemAllocRaw().

I mentioned v5 adds alignment, but I think it does not quite do that
quite correctly. It adds alignment by changing the macros from:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+ (sizeof(HASHHDR) + \
+ ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+ ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

to

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+ (MAXALIGN(sizeof(HASHHDR)) + \
+ ((hctl)->dsize * MAXALIGN(sizeof(HASHSEGMENT))) + \
+ ((hctl)->ssize * (nsegs) * MAXALIGN(sizeof(HASHBUCKET))))

First, it uses MAXALIGN, but that's mostly my fault, because my comment
suggested that - the ShmemAllocRaw however and makes the case for using
CACHELINEALIGN.

But more importantly, it adds alignment to all hctl field, and to every
element of those arrays. But that's not what the alignment was supposed
to do - it was supposed to align arrays, not individual elements. Not
only would this waste memory, it would actually break direct access to
those array elements.

So something like this might be more correct:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+ (MAXALIGN(sizeof(HASHHDR)) + \
+ MAXALIGN((hctl)->dsize * izeof(HASHSEGMENT)) + \
+ MAXALIGN((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

But there's another detail - even before this patch, most of the stuff
was allocated at once by ShmemInitStruct(). Everything except for the
elements, so to replicate the alignment we only need to worry about that
last part. So I think this should do:

+#define HASH_ELEMENTS_OFFSET(hctl, nsegs) \
+ CACHELINEALIGN(sizeof(HASHHDR) + \
+ ((hctl)->dsize * sizeof(HASHSEGMENT)) + \
+ ((hctl)->ssize * (nsegs) * sizeof(HASHBUCKET)))

This is what the 0003 patch does. There's still one minor difference, in
that we used to align each segment independently - each element_alloc()
call allocated a new CACHELINEALIGN-ed chunk, while now have just a
single chunk. But I think that's OK.

In fact, I'm starting to wonder if we need to worry about this at all.
Maybe it's not that significant for dynahash after all - otherwise we
actually should align/pad the individual elements, not just the array as
a whole, I think.

2) I do think the last patch should use CACHELINEALIGN() too, instead of
adding PG_CACHE_LINE_SIZE to the sizes (which does not ensure good
alignment, it just means there's gap between the allocated parts). But I
still don't know what the intention was, so hard to say ...

3) I find the comment before hash_get_init_size a bit unclear/confusing.
It says this:

* init_size should match the total number of elements allocated during
* hash table creation, it could be zero for non-shared hash tables
* depending on the value of nelem_alloc. For more explanation see
* comments within this function.
*
* nelem_alloc parameter is not relevant for shared hash tables.

What does "should match" mean here? Doesn't it *determine* the number of
elements allocated? What if it doesn't match?

AFAICS it means the hash table is sized to expect init_size elements,
but only nelem_alloc elements are actually pre-allocated, right? But the
comment says it's init_size which determines the number of elements
allocated during creation. Confusing.

It says "it could be zero ... depending on the value of nelem_alloc".
Depending how? What's the relationship.

Then it says "nelem_alloc parameter is not relevant for shared hash
tables". What does "not relevant" mean? It should say it's ignored.

The bit "For more explanation see comments within this function" is not
great, if only because there are not many comments within the function,
so there's no "more explanation". But if there's something important, it
should be in the main comment, preferably.

regards

--
Tomas Vondra

Attachment Content-Type Size
v6-0001-Account-for-all-the-shared-memory-allocated-by-ha.patch text/x-patch 15.4 KB
v6-0002-review.patch text/x-patch 10.5 KB
v6-0003-align-using-CACHELINESIZE-to-match-ShmemAllocRaw.patch text/x-patch 1.3 KB
v6-0004-Replace-ShmemAlloc-calls-by-ShmemInitStruct.patch text/x-patch 7.2 KB
v6-0005-review.patch text/x-patch 4.6 KB
v6-0006-Add-cacheline-padding-between-heavily-accessed-ar.patch text/x-patch 2.0 KB
v6-0007-review.patch text/x-patch 2.3 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-03-27 23:57:34 Re: [PATCH] PGSERVICEFILE as part of a normal connection string
Previous Message Sadeq Dousti 2025-03-27 23:28:50 Re: psql \dh: List High-Level (Root) Tables and Indexes