Re: Unexplained rapid growth in memory usage of idle backends

From: J T <jorge(dot)torralba(at)gmail(dot)com>
To: Alex Balashov <abalashov(at)evaristesys(dot)com>
Cc: pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re: Unexplained rapid growth in memory usage of idle backends
Date: 2023-05-13 15:09:34
Message-ID: CACut7uQd11zrHaWD_d8yDUxv7U_=tUR5+6POFHKFbF17EfU-Uw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

What is old_snapshot_threshold set to ?

We had runaway and stacked up sessions were occuring which resulted in the
system becoming practically useless. Transactions were dropping and
sessions were climbing. We used the system profiling tool "perf" with top
to trace the bottle neck down to spin locks. The large number of spin locks
were bringing the system to its knees. The perf output also showed the
"TransactionLimitedForOldSnapshot" function as a consumer of resource.
Which brought to mind a known issue with the parameter
old_snapshot_threshold which had a non default value. There have been
reported issues in the Postgres community about old_snapshot_threshold and
bringing a system down completely under load. Have a look at that.

Also, this is really old but worth mentioning anyway, What are Transparent
Huge pages set to at the kernel? I disable that regardless of it being
resolved or not since it was a DB killer at one point.

JT

On Fri, May 12, 2023 at 3:12 PM Alex Balashov <abalashov(at)evaristesys(dot)com>
wrote:

> Hi,
>
> We have an application that has a preforked worker pool architecture and
> opens a relatively large (70+) number of persistent connections to
> PostgreSQL 14, and have `max_connections` set to 200 accordingly.
>
> This has worked well enough in many deployments, and the backends
> corresponding to these connections mostly sit idle. Occasionally they are
> hit with a query, and those workloads are almost entirely PL/PgSQL stored
> functions. These functions mostly just aggregate output from various
> tables, and occasionally apply some business logic; no temporary tables, no
> runaway recursive CTEs, nothing baroque.
>
> Occasionally, about once every 2 months, we'll get a slow-motion implosion
> over a period of about 24 hours, where the resident memory size of some of
> the backends shoots up from a reasonable few hundred MB to several gigs --
> and will keep growing, e.g.
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+
> COMMAND
> 3932733 postgres 20 0 637740 327124 323840 R 10.7 0.5 0:00.11
> postgres
> 782930 postgres 20 0 3151360 2.8g 551704 S 9.7 4.5 221:43.17
> postgres
> 782839 postgres 20 0 2999896 2.7g 551832 S 8.7 4.3 220:54.55
> postgres
> 3932734 postgres 20 0 636128 338168 336096 R 8.7 0.5 0:00.09
> postgres
> 3932735 postgres 20 0 636128 319824 317768 R 8.7 0.5 0:00.09
> postgres
> 782851 postgres 20 0 3142152 2.8g 551872 S 7.8 4.5 221:22.62
> postgres
> 782855 postgres 20 0 3155144 2.8g 551828 S 7.8 4.5 221:38.59
> postgres
> 782897 postgres 20 0 3148808 2.8g 551724 S 7.8 4.5 220:05.94
> postgres
> 783106 postgres 20 0 3152416 2.8g 551912 S 7.8 4.5 221:24.40
> postgres
> 783168 postgres 20 0 2992592 2.7g 551968 S 7.8 4.2 220:36.06
> postgres
> 782843 postgres 20 0 3146772 2.8g 551944 R 4.9 4.5 221:38.46
> postgres
> 782778 postgres 20 0 3150956 2.8g 551852 R 3.9 4.5 220:49.88
> postgres
>
> It would be only natural to expect a cornucopia of heavyweight queries
> there, but there aren't any. `pg_stat_activity` just reveals an active
> WalSender (for streaming replication) and the `pg_stat_activity` query
> (WHERE state <> 'idle') itself. Once in a while, I'll catch a shortlived
> query in _one_ of these, if I run it often enough.
>
> The state of affairs deteriorates until either
>
> 1) the grim OOM reaper comes:
>
> [19063737.017400] Out of memory: Killed process 536356 (postgres)
> total-vm:21703068kB, anon-rss:20804636kB, file-rss:4kB, shmem-rss:534896kB,
> UID:107 pgtables:42384kB oom_score_adj:0
> [19063739.149563] oom_reaper: reaped process 536356 (postgres), now
> anon-rss:0kB, file-rss:0kB, shmem-rss:534896kB,
>
> or
>
> 2) the client application is restarted, closing the persistent connections
> and terminating their corresponding backends. This releases the memory back
> to the OS and all is well again.
>
> For reference:
>
> ----- postgresql.auto.conf -----
> listen_addresses = '*'
> timezone = 'UTC'
> max_connections = '200'
> effective_cache_size = '2GB'
> work_mem = '128MB'
> wal_level = 'replica'
> fsync = 'on'
> synchronous_commit = 'off'
> checkpoint_completion_target = '0.75'
> checkpoint_warning = '15min'
> autovacuum = 'on'
> autovacuum_freeze_max_age = '200000000'
> vacuum_freeze_min_age = '10000'
> vacuum_freeze_table_age = '1000000000'
> primary_conninfo = 'user=replication passfile=''/root/.pgpass''
> channel_binding=prefer host=xxx.xxx.xxx.xxx port=5432'
> promote_trigger_file = '/var/lib/postgresql/14/main/failover_trig'
> primary_slot_name = 'replication_db5'
> log_min_duration_statement = '-1'
> max_worker_processes = '4'
> max_parallel_workers_per_gather = '4'
> shared_buffers = '512MB'
> --------------------------------
>
> And:
>
> ----- free -----
> total used free shared buff/cache
> available
> Mem: 65837856 2623820 47176960 567988 16037076
> 61989160
> Swap: 1000444 0 1000444
> ----------------
>
> Lastly, I will reiterate that there is no evidence of a slow-brewing
> memory leak. Memory usage seems more or less steady-state in general,
> rising and falling in expected ways with peak and off-peak workloads. Then,
> suddenly, some of the backends go into into the aforementioned nosedive.
>
> I have considered using a connection pooler to limit the overall memory
> footprint and blast radius, and while this might address the problem from
> an operational point of view, it does not really resolve the essential
> question: why is this happening in the first place, seemingly out of
> nowhere?
>
> I will also say that while the client application does call a lot of
> stored functions, they are all rather circumscribed in scope, in support of
> real-time routing decisions. These are not expensive reporting queries as
> might be issued from an API or a user interface of some kind, for example.
> The client IPs on the problematic backends above correspond to the client
> application, not to any other connected clients.
>
> I'm at an utter loss as to how to troubleshoot or prevent this. Any
> insight would be deeply appreciated!
>
> -- Alex
>
> --
> Alex Balashov
> Principal Consultant
> Evariste Systems LLC
> Web: https://evaristesys.com
> Tel: +1-706-510-6800
>
>
>
>

--
Thanks,

Jorge Torralba
----------------------------

Note: This communication may contain privileged or other confidential
information. If you are not the intended recipient, please do not print,
copy, retransmit, disseminate or otherwise use the information. Please
indicate to the sender that you have received this email in error and
delete the copy you received. Thank You.

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Alex Balashov 2023-05-13 23:49:58 Re: Unexplained rapid growth in memory usage of idle backends
Previous Message Tom Lane 2023-05-13 01:52:55 Re: Unexplained rapid growth in memory usage of idle backends