Quick Links

Re: Improve monitoring of shared memory allocations

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Rahila Syed <rahilasyed90(at)gmail(dot)com>
Cc:	PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Improve monitoring of shared memory allocations
Date:	2025-03-03 18:07:47
Message-ID:	k6f6ynjvb7lvebhygaiqsfrdohq672uughk3q4ve4q5jqljywz@7jbrz724epeq
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

Thanks for sending these, the issues addressed here have been bugging me for a
long while.

On 2025-03-01 10:19:01 +0530, Rahila Syed wrote:
> The 0001* patch improved the accounting for the shared memory allocated for
> a hash table during hash_create. pg_shmem_allocations tracks the memory
> allocated by ShmemInitStruct, which, for shared hash tables, only covers
> memory allocated for the hash directory and control structure via
> ShmemInitHash. The hash segments and buckets are allocated using
> ShmemAllocNoError, which does not attribute the allocation to the hash table
> and also does not add it to ShmemIndex.
>
> Therefore, these allocations are not tracked in pg_shmem_allocations. To
> improve this, include the allocation of segments and buckets in the initial
> allocation of the shared memory for the hash table, in ShmemInitHash.
>
> This will result in pg_shmem_allocations representing the total size of the
> initial hash table, including all the buckets and elements, instead of just
> the directory size.

I think this should also be more efficient. Less space wasted on padding and
fewer indirect function calls.

> Like ShmemAllocNoError, the shared memory allocated by ShmemAlloc is not
> tracked by pg_shmem_allocations. The 0002* patch replaces most of the calls
> to ShmemAlloc with ShmemInitStruct to associate a name with the allocations
> and ensure that they get tracked by pg_shmem_allocations.

In some of these cases it may be better to combine the allocation with a prior
ShmemInitStruct instead of doing a separate ShmemInitStruct() for the
allocations that are doing ShmemAlloc() right now.

cfbot found a few compiler warnings:

https://cirrus-ci.com/task/6526903542087680
[16:47:46.964] make -s -j${BUILD_JOBS} clean
[16:47:47.452] time make -s -j${BUILD_JOBS} world-bin
[16:49:10.496] lwlock.c: In function ‘CreateLWLocks’:
[16:49:10.496] lwlock.c:467:22: error: unused variable ‘found’ [-Werror=unused-variable]
[16:49:10.496] 467 | bool found;
[16:49:10.496] | ^~~~~
[16:49:10.496] cc1: all warnings being treated as errors
[16:49:10.496] make[4]: *** [<builtin>: lwlock.o] Error 1
[16:49:10.496] make[3]: *** [../../../src/backend/common.mk:37: lmgr-recursive] Error 2
[16:49:10.496] make[3]: *** Waiting for unfinished jobs....
[16:49:11.881] make[2]: *** [common.mk:37: storage-recursive] Error 2
[16:49:11.881] make[2]: *** Waiting for unfinished jobs....
[16:49:20.195] dynahash.c: In function ‘hash_create’:
[16:49:20.195] dynahash.c:643:37: error: ‘curr_offset’ may be used uninitialized [-Werror=maybe-uninitialized]
[16:49:20.195] 643 | curr_offset = (((char *)curr_offset) + (temp * elementSize));
[16:49:20.195] | ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
[16:49:20.195] dynahash.c:588:23: note: ‘curr_offset’ was declared here
[16:49:20.195] 588 | void *curr_offset;
[16:49:20.195] | ^~~~~~~~~~~
[16:49:20.195] cc1: all warnings being treated as errors
[16:49:20.196] make[4]: *** [<builtin>: dynahash.o] Error 1

> From c13a7133ed455842b685426217a7b5079e6fc869 Mon Sep 17 00:00:00 2001
> From: Rahila Syed <rahilasyed(dot)90(at)gmail(dot)com>
> Date: Fri, 21 Feb 2025 15:08:12 +0530
> Subject: [PATCH 1/2] Account for initial shared memory allocated during
> hash_create.
>
> pg_shmem_allocations tracks the memory allocated by ShmemInitStruct,
> which, in case of shared hash tables, only covers memory allocated
> to the hash directory and control structure. The hash segments and
> buckets are allocated using ShmemAllocNoError which does not attribute
> the allocations to the hash table name. Thus, these allocations are
> not tracked in pg_shmem_allocations.
>
> Improve this include the alocation of segments and buckets or elements
> in the initial allocation of shared hash directory. Since this adds numbers
> to existing hash table entries, the resulting tuples in pg_shmem_allocations
> represent the total size of the initial hash table including all the
> buckets and the elements they contain, instead of just the directory size.

> diff --git a/src/backend/utils/hash/dynahash.c b/src/backend/utils/hash/dynahash.c
> index cd5a00132f..5203f5b30b 100644
> --- a/src/backend/utils/hash/dynahash.c
> +++ b/src/backend/utils/hash/dynahash.c
> @@ -120,7 +120,6 @@
> * a good idea of the maximum number of entries!). For non-shared hash
> * tables, the initial directory size can be left at the default.
> */
> -#define DEF_SEGSIZE 256
> #define DEF_SEGSIZE_SHIFT 8 /* must be log2(DEF_SEGSIZE) */
> #define DEF_DIRSIZE 256

Why did you move this to the header? Afaict it's just used in
hash_get_shared_size(), which is also in dynahash.c?

> @@ -265,7 +264,7 @@ static long hash_accesses,
> */
> static void *DynaHashAlloc(Size size);
> static HASHSEGMENT seg_alloc(HTAB *hashp);
> -static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx);
> +static bool element_alloc(HTAB *hashp, int nelem, int freelist_idx, HASHELEMENT *firstElement);
> static bool dir_realloc(HTAB *hashp);
> static bool expand_table(HTAB *hashp);
> static HASHBUCKET get_hash_entry(HTAB *hashp, int freelist_idx);
> @@ -276,11 +275,10 @@ static void hash_corrupted(HTAB *hashp) pg_attribute_noreturn();
> static uint32 hash_initial_lookup(HTAB *hashp, uint32 hashvalue,
> HASHBUCKET **bucketptr);
> static long next_pow2_long(long num);
> -static int next_pow2_int(long num);
> static void register_seq_scan(HTAB *hashp);
> static void deregister_seq_scan(HTAB *hashp);
> static bool has_seq_scans(HTAB *hashp);
> -
> +static int find_num_of_segs(long nelem, int *nbuckets, long num_partitions, long ssize);
>
> /*
> * memory allocation support

You removed a newline here that probably shouldn't be removed.

> @@ -468,7 +466,11 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> else
> hashp->keycopy = memcpy;
>
> - /* And select the entry allocation function, too. */
> + /*
> + * And select the entry allocation function, too. XXX should this also
> + * Assert that flags & HASH_SHARED_MEM is true, since HASH_ALLOC is
> + * currently only set with HASH_SHARED_MEM *
> + */
> if (flags & HASH_ALLOC)
> hashp->alloc = info->alloc;
> else
> @@ -518,6 +520,7 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
>
> hashp->frozen = false;
>
> + /* Initializing the HASHHDR variables with default values */
> hdefault(hashp);
>
> hctl = hashp->hctl;

I assume these were just observations you made while looking into this? They
seem unrelated to the change itself?

> @@ -582,7 +585,8 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> freelist_partitions,
> nelem_alloc,
> nelem_alloc_first;
> -
> + void *curr_offset;
> +
> /*
> * If hash table is partitioned, give each freelist an equal share of
> * the initial allocation. Otherwise only freeList[0] is used.
> @@ -592,6 +596,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> else
> freelist_partitions = 1;
>
> + /*
> + * If table is shared, calculate the offset at which to find the
> + * the first partition of elements
> + */
> + if (hashp->isshared)
> + {
> + int nsegs;
> + int nbuckets;
> + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
> +
> + curr_offset = (((char *) hashp->hctl) + sizeof(HASHHDR) + (info->dsize * sizeof(HASHSEGMENT)) +
> + + (sizeof(HASHBUCKET) * hctl->ssize * nsegs));
> + }
> +

Why only do this for shared hashtables? Couldn't we allocate the elments
together with the rest for non-share hashtables too?

> nelem_alloc = nelem / freelist_partitions;
> if (nelem_alloc <= 0)
> nelem_alloc = 1;
> @@ -609,11 +627,20 @@ hash_create(const char *tabname, long nelem, const HASHCTL *info, int flags)
> for (i = 0; i < freelist_partitions; i++)
> {
> int temp = (i == 0) ? nelem_alloc_first : nelem_alloc;
> -
> - if (!element_alloc(hashp, temp, i))
> + HASHELEMENT *firstElement;
> + Size elementSize = MAXALIGN(sizeof(HASHELEMENT)) + MAXALIGN(hctl->entrysize);
> +
> + /* Memory is allocated as part of initial allocation in ShmemInitHash */
> + if (hashp->isshared)
> + firstElement = (HASHELEMENT *) curr_offset;
> + else
> + firstElement = NULL;
> +
> + if (!element_alloc(hashp, temp, i, firstElement))
> ereport(ERROR,
> (errcode(ERRCODE_OUT_OF_MEMORY),
> errmsg("out of memory")));
> + curr_offset = (((char *)curr_offset) + (temp * elementSize));
> }
> }

Seems a bit ugly to go through element_alloc() when pre-allocating. Perhaps
it's the best thing we can do to avoid duplicating code, but it seems worth
checking if we can do better. Perhaps we could split element_alloc() into
element_alloc() and element_add() or such? With the latter doing everything
after hashp->alloc().

> @@ -693,7 +720,7 @@ init_htab(HTAB *hashp, long nelem)
> int nbuckets;
> int nsegs;
> int i;
> -
> +
> /*
> * initialize mutexes if it's a partitioned table
> */

Spurious change.

> @@ -701,30 +728,11 @@ init_htab(HTAB *hashp, long nelem)
> for (i = 0; i < NUM_FREELISTS; i++)
> SpinLockInit(&(hctl->freeList[i].mutex));
>
> - /*
> - * Allocate space for the next greater power of two number of buckets,
> - * assuming a desired maximum load factor of 1.
> - */
> - nbuckets = next_pow2_int(nelem);
> -
> - /*
> - * In a partitioned table, nbuckets must be at least equal to
> - * num_partitions; were it less, keys with apparently different partition
> - * numbers would map to the same bucket, breaking partition independence.
> - * (Normally nbuckets will be much bigger; this is just a safety check.)
> - */
> - while (nbuckets < hctl->num_partitions)
> - nbuckets <<= 1;
> + nsegs = find_num_of_segs(nelem, &nbuckets, hctl->num_partitions, hctl->ssize);
>
> hctl->max_bucket = hctl->low_mask = nbuckets - 1;
> hctl->high_mask = (nbuckets << 1) - 1;
>
> - /*
> - * Figure number of directory segments needed, round up to a power of 2
> - */
> - nsegs = (nbuckets - 1) / hctl->ssize + 1;
> - nsegs = next_pow2_int(nsegs);
> -
> /*
> * Make sure directory is big enough. If pre-allocated directory is too
> * small, choke (caller screwed up).

A function called find_num_of_segs() that also sets nbuckets seems a bit
confusing. I also don't like "find_*", as that sounds like it's searching
some datastructure, rather than just doing a bit of math.

> @@ -1689,6 +1726,7 @@ seg_alloc(HTAB *hashp)
> HASHSEGMENT segp;
>
> CurrentDynaHashCxt = hashp->hcxt;
> +
> segp = (HASHSEGMENT) hashp->alloc(sizeof(HASHBUCKET) * hashp->ssize);
>
> if (!segp)

Spurious change.

> @@ -1816,7 +1854,7 @@ next_pow2_long(long num)
> }
>
> /* calculate first power of 2 >= num, bounded to what will fit in an int */
> -static int
> +int
> next_pow2_int(long num)
> {
> if (num > INT_MAX / 2)
> @@ -1957,3 +1995,31 @@ AtEOSubXact_HashTables(bool isCommit, int nestDepth)
> }
> }
> }

Why export this?

> diff --git a/src/include/utils/hsearch.h b/src/include/utils/hsearch.h
> index 932cc4f34d..5e16bd4183 100644
> --- a/src/include/utils/hsearch.h
> +++ b/src/include/utils/hsearch.h
> @@ -151,7 +151,7 @@ extern void hash_seq_term(HASH_SEQ_STATUS *status);
> extern void hash_freeze(HTAB *hashp);
> extern Size hash_estimate_size(long num_entries, Size entrysize);
> extern long hash_select_dirsize(long num_entries);
> -extern Size hash_get_shared_size(HASHCTL *info, int flags);
> +extern Size hash_get_shared_size(HASHCTL *info, int flags, long init_size);
> extern void AtEOXact_HashTables(bool isCommit);
> extern void AtEOSubXact_HashTables(bool isCommit, int nestDepth);

It's imo a bit weird that we have very related logic in hash_estimate_size()
and hash_get_shared_size(). Why do we need to duplicate it?

> --- a/src/backend/storage/lmgr/predicate.c
> +++ b/src/backend/storage/lmgr/predicate.c
> @@ -1244,7 +1244,7 @@ PredicateLockShmemInit(void)
> PredXact->HavePartialClearedThrough = 0;
> requestSize = mul_size((Size) max_table_size,
> sizeof(SERIALIZABLEXACT));
> - PredXact->element = ShmemAlloc(requestSize);
> + PredXact->element = ShmemInitStruct("SerializableXactList", requestSize, &found);
> /* Add all elements to available list, clean. */
> memset(PredXact->element, 0, requestSize);
> for (i = 0; i < max_table_size; i++)
> @@ -1299,9 +1299,11 @@ PredicateLockShmemInit(void)
> * probably OK.
> */
> max_table_size *= 5;
> + requestSize = mul_size((Size) max_table_size,
> + RWConflictDataSize);
>
> RWConflictPool = ShmemInitStruct("RWConflictPool",
> - RWConflictPoolHeaderDataSize,
> + RWConflictPoolHeaderDataSize + requestSize,
> &found);
> Assert(found == IsUnderPostmaster);
> if (!found)
> @@ -1309,9 +1311,7 @@ PredicateLockShmemInit(void)
> int i;
>
> dlist_init(&RWConflictPool->availableList);
> - requestSize = mul_size((Size) max_table_size,
> - RWConflictDataSize);
> - RWConflictPool->element = ShmemAlloc(requestSize);
> + RWConflictPool->element = (RWConflict) ((char *)RWConflictPool + RWConflictPoolHeaderDataSize);
> /* Add all elements to available list, clean. */
> memset(RWConflictPool->element, 0, requestSize);
> for (i = 0; i < max_table_size; i++)

These I'd just combine with the ShmemInitStruct("PredXactList"), by allocating
the additional space. The pointer math is a bit annoying, but it makes much
more sense to have one entry in pg_shmem_allocations.

> diff --git a/src/backend/storage/lmgr/proc.c b/src/backend/storage/lmgr/proc.c
> index 49204f91a2..e112735d93 100644
> --- a/src/backend/storage/lmgr/proc.c
> +++ b/src/backend/storage/lmgr/proc.c
> @@ -218,11 +218,11 @@ InitProcGlobal(void)
> * how hotly they are accessed.
> */
> ProcGlobal->xids =
> - (TransactionId *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->xids));
> + (TransactionId *) ShmemInitStruct("Proc Transaction Ids", TotalProcs * sizeof(*ProcGlobal->xids), &found);
> MemSet(ProcGlobal->xids, 0, TotalProcs * sizeof(*ProcGlobal->xids));
> - ProcGlobal->subxidStates = (XidCacheStatus *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->subxidStates));
> + ProcGlobal->subxidStates = (XidCacheStatus *) ShmemInitStruct("Proc Sub-transaction id states", TotalProcs * sizeof(*ProcGlobal->subxidStates), &found);
> MemSet(ProcGlobal->subxidStates, 0, TotalProcs * sizeof(*ProcGlobal->subxidStates));
> - ProcGlobal->statusFlags = (uint8 *) ShmemAlloc(TotalProcs * sizeof(*ProcGlobal->statusFlags));
> + ProcGlobal->statusFlags = (uint8 *) ShmemInitStruct("Proc Status Flags", TotalProcs * sizeof(*ProcGlobal->statusFlags), &found);
> MemSet(ProcGlobal->statusFlags, 0, TotalProcs * sizeof(*ProcGlobal->statusFlags));
>
> /*

Same.

Although here I'd say it's worth padding the size of each separate
"allocation" by PG_CACHE_LINE_SIZE.

> @@ -233,7 +233,7 @@ InitProcGlobal(void)
> fpLockBitsSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(uint64));
> fpRelIdSize = MAXALIGN(FastPathLockGroupsPerBackend * sizeof(Oid) * FP_LOCK_SLOTS_PER_GROUP);
>
> - fpPtr = ShmemAlloc(TotalProcs * (fpLockBitsSize + fpRelIdSize));
> + fpPtr = ShmemInitStruct("Fast path lock arrays", TotalProcs * (fpLockBitsSize + fpRelIdSize), &found);
> MemSet(fpPtr, 0, TotalProcs * (fpLockBitsSize + fpRelIdSize));
>
> /* For asserts checking we did not overflow. */

This one might actually make sense to keep separate, depending on the
configuration it can be reasonably big (max_connection = 1k,
max_locks_per_transaction=1k results in ~5MB)..

Greetings,

Andres Freund

In response to

Improve monitoring of shared memory allocations at 2025-03-01 04:49:01 from Rahila Syed

Responses

Re: Improve monitoring of shared memory allocations at 2025-03-06 15:58:17 from Rahila Syed

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2025-03-03 18:10:08	Re: scalability bottlenecks with (many) partitions (and more)
Previous Message	Alvaro Herrera	2025-03-03 18:07:00	Re: why there is not VACUUM FULL CONCURRENTLY?