From: | Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com> |
---|---|
To: | Tomas Vondra <tomas(at)vondra(dot)me> |
Cc: | shawn wang <shawn(dot)wang(dot)pg(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, David Rowley <dgrowleyml(at)gmail(dot)com>, Rafia Sabih <rafia(dot)pghackers(at)gmail(dot)com>, Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Trim the heap free memory |
Date: | 2025-01-21 09:16:03 |
Message-ID: | CAKZiRmybrFynMTio_L44DdW5o3Uc5oTQx7-Om4AOvk4k78HSYw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sun, Dec 8, 2024 at 7:48 PM Tomas Vondra <tomas(at)vondra(dot)me> wrote:
[..]
> >> I have previously encountered situations where the non-garbage-collected
> >> memory of wal_sender was approximately hundreds of megabytes or even
> >> exceeded 1GB, but I was unable to reproduce this situation using simple
> >> SQL. Therefore, I introduced an asynchronous processing function, hoping
> >> to manage memory more efficiently without affecting performance.
> >>
> >
> > I doubt a system function is the right approach to deal with these
> > memory allocation issues. The function has to be called by the user,
> > which means the user is expected to monitor the system and decide when
> > to invoke the function. That seems far from trivial - it would require
> > collecting OS-level information about memory usage, and I suppose it'd
> > need to happen fairly often to actually help with OOM reliably.
[..]
> > Sure, forcing the system to release memory more aggressively may affect
> > performance - that's the tradeoff done by glibc. But calling the new
> > pg_trim_backend_heap_free_memory() function is not free either.
> >
> > But why would it force returning the memory to be returned immediately?
> > The decision whether to trim memory is driven by M_TRIM_THRESHOLD, and
> > that does not need to be 0. In fact, it's 128kB by default, i.e. glibc
> > trims memory automatically, if it can trim at least 128kB.
[..]
> To propose something less abstract / more tangible, I think we should do
> something like this:
>
> 1) add a bit of code for glibc-based systems, that adjusts selected
> malloc parameters using mallopt() during startup
>
> 2) add a GUC that enables this, with the default being the regular glibc
> behavior (with dynamic adjustment of various thresholds)
>
>
> Which exact parameters would this set is an open question, but based on
> my earlier experiments, Ronan's earlier patches, etc. I think it should
> adjust at least
>
> M_TRIM_THRESHOLD - to make sure we trim heap regularly
> M_TOP_PAD - to make sure we cache some allocated memory
>
> I wonder if maybe we should tune M_MMAP_THRESHOLD, which on 64-bit
> systems defaults to 32MB, so we don't really mmap() very often for
> regular memory contexts. But I don't know if that's a good idea, that
> would need some experiments.
>
> I believe that's essentially what Ronan Dunklau proposed, but it
> stalled. Not because of some inherent complexity, but because of
> concerns about introducing glibc-specific code.
>
> Based on my recent experiments I think it's clearly worth it (esp. with
> high concurrency workloads). If glibc was a niche, it'd be a different
> situation, but I'd guess vast majority of databases runs on glibc. Yes,
> it's possible to do these changes without new code (e.g. by setting the
> environment variables), but that's rather inconvenient.
>
> Perhaps it'd be possible to make it a bit smarter by looking at malloc
> stats, and adjust the trim/pad thresholds, but I'd leave that for the
> future. It might even lead to similar issues with excessive memory usage
> just like the logic built into glibc.
>
> But maybe we could at least print / provide some debugging information?
> That would help with adjusting the GUC ...
Hi all,
Thread bump. Just to add one single data point to this discussion, we
have been chasing some ghost memory leaks that apparently were not
memory leaks after all (they stop at certain threshold like 1.2GB)
but there were still OOMs present, and after some experimentation it
seemed that memory ended up being used in MemoryContexts, but
afterwards it was released (so outside of TopMemoryContext) when
session went idle/idle in transaction, but the processes was *still*
having it allocated. Injecting a call to `malloc_trim()` released
backend memory for sessions that were idle for some time.
E.g. with PG 13.x I've got more or less sample reproducer (thanks to
my colleague Matthew Gwillam-Kelly who was working on initial
identification of the problem):
DROP TABLE p;
CREATE TABLE p (
id int not null,
sensor_id bigint not null,
val bigint
) PARTITION BY HASH (sensor_id);
CREATE INDEX p_idx ON P (val);
SELECT 'CREATE TABLE p_'||g||' PARTITION OF p FOR VALUES WITH (MODULUS
1000, REMAINDER ' || g || ');' FROM generate_series(0, 999) g;
\gexec
INSERT INTO p SELECT g, g, g FROM generate_series(1, 1000000) g;
ANALYZE p;
Run `UPDATE p SET val = val;` minium 3 or 4 times in new session, the
backend will use in my case like ~400MB and stay (!) like that for
infinite time:
$ grep ^Pss /proc/27421/smaps_rollup
Pss: 399291 kB
Pss_Dirty: 397351 kB
Pss_Anon: 353859 kB
Pss_File: 1939 kB
Pss_Shmem: 43492 kB
After injecting call to malloc_trim(0) it shows much lower Pss_Anon:
$ grep ^Pss /proc/27421/smaps_rollup
Pss: 65904 kB
Pss_Dirty: 64189 kB
Pss_Anon: 23231 kB
Pss_File: 1715 kB
Pss_Shmem: 40957 kB
NOTE: it is not depending on (maintenance_)work_mem variables, more
like PG version involved, extensions, encoding probably, partitions
count, triggers maybe. That's like ~353MB wasted above (but our
customer was hitting it in ~1.2 GB range but they were having
additional extensions loaded which could further amplify the effect)
with fully allocated memory without usage in memory contexts (pfree()
were successful, free() done nothing, it's just it's not returned back
to the OS), so before the trim it is like that this:
TopMemoryContext: 801664 total in 29 blocks; 498048 free (2033
chunks); 303616 used
[..]
Grand total: 22213784 bytes in 3129 blocks; 9674384 free (3393
chunks); 12539400 used
Such single UPDATE causes the following malloc frequency histogram of
sizes in malloc():
@:
[1] 1 | |
[2, 4) 43 | |
[4, 8) 81 | |
[8, 16) 261 |@ |
[16, 32) 10049 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32, 64) 8951 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[64, 128) 446 |@@ |
[128, 256) 133 | |
[256, 512) 118 | |
[512, 1K) 11 | |
[1K, 2K) 134 | |
[2K, 4K) 5 | |
[4K, 8K) 94 | |
[8K, 16K) 1020 |@@@@@ |
[16K, 32K) 4122 |@@@@@@@@@@@@@@@@@@@@@ |
[32K, 64K) 29 | |
[64K, 128K) 14 | |
[128K, 256K) 2196 |@@@@@@@@@@@ |
[256K, 512K) 2 | |
[..]
E.g one of the hot paths for this there is (remember it's still PG13)
heap_update->RelationGetBufferForTuple->GetPageWithFreeSpace->fsm_search->fsm_readbuf->mdopenfork->mdopenfork->PathNameOpenFile->PathNameOpenFilePerm->__GI___strdup
. Here's it's strdup() but it could be anything and that's the point.
This effect in libc is completley reproducible, please see attached,
any use of allocating small (<= 120 bytes) ends up not releasing
memory for the program.
$ gcc mwr.c -o mwr -DMALLOC_SIZE=120 && ./mwr
done
Rss: 1251460 kB
Pss: 1250136 kB
Pss_Dirty: 1250112 kB
Pss_Anon: 1250100 kB
Pss_File: 36 kB
Pss_Shmem: 0 kB
after malloc_trim:
Rss: 1460 kB
Pss: 136 kB
Pss_Dirty: 100 kB
Pss_Anon: 100 kB
Pss_File: 36 kB
Pss_Shmem: 0 kB
$ gcc mwr.c -o mwr -DMALLOC_SIZE=121 && ./mwr # 120+8 >= 128
done
Rss: 1676 kB
Pss: 259 kB
Pss_Dirty: 224 kB
Pss_Anon: 224 kB
Pss_File: 35 kB
Pss_Shmem: 0 kB
after malloc_trim:
Rss: 1548 kB
Pss: 131 kB
Pss_Dirty: 96 kB
Pss_Anon: 96 kB
Pss_File: 35 kB
Pss_Shmem: 0 kB
Now, the current PG18 behaved much better in that regard without that
many small mallocs during runtime (strdup() is still there, it's just
that hotpath not exercised that often):
@:
[8, 16) 2697 |@@@@@@@ |
[16, 32) 2203 |@@@@@@ |
[32, 64) 0 | |
[64, 128) 0 | |
[128, 256) 0 | |
[256, 512) 0 | |
[512, 1K) 0 | |
[1K, 2K) 3014 |@@@@@@@@ |
[2K, 4K) 5 | |
[4K, 8K) 2 | |
[8K, 16K) 1107 |@@@ |
[16K, 32K) 18112 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[32K, 64K) 12 | |
[..]
Yet I still could drop Pss_Anon by using malloc_trim(0) from ~44MB to
~13MB. Assume we have 1k idle connections like this and you end up
wasting 30GB RAM theoretically.
So we basically we have two generic solutions to this class of
problems to avoid OOMs due to GNU libc's malloc() not releasing
memory:
0. Disconnecting the backend (I'm not counting it as it doesn't seem
to be a solid long term solution, but it explains why people push for
poolers with refreshable connection pools).
1. Call malloc_trim(0), but Tom stated it might be not portable. So
maybe there is a chance for extension or #ifdefs . I do think that
calling it after every query might be not ideal due to overheads, but
perhaps after query is done we could schedule interrupt aimed at
now()+X seconds (where X>= 5?), so execute it only when the backend
went really inactive (to avoid re-allocating the memory again), but
abort launching this it if we have started next query. I haven't
looked at the code so i don't know if that can be done cheaply.
2. Or use GLIBC_TUNABLES e.g. disable mxfast bin allocations shows
some promise even still with many small allocations
$ gcc mwr.c -o mwr -DMALLOC_SIZE=120 &&
GLIBC_TUNABLES=glibc.malloc.mxfast=0 ./mwr
done
Rss: 1680 kB
Pss: 257 kB
Pss_Dirty: 236 kB # no need for malloc_trim()
Pss_Anon: 224 kB
Pss_File: 33 kB
Pss_Shmem: 0 kB
[..]
From my side also -1 to the idea of pg_trim_backend_heap_free_memory()
exposed function as per original patch proposal, as how is the user
supposed to embed this within his application?
I have not quantified the overhead for #1 and #2.
-J.
Attachment | Content-Type | Size |
---|---|---|
mwr.c | text/plain | 704 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Hunaid Sohail | 2025-01-21 09:26:12 | Re: [PATCH] Add roman support for to_number function |
Previous Message | Hunaid Sohail | 2025-01-21 08:45:12 | Re: [PATCH] Add roman support for to_number function |