Re: PostgreSQL 14.4 ERROR: out of memory issues

From: Pavel Stehule <pavel(dot)stehule(at)gmail(dot)com>
To: Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-generallists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: PostgreSQL 14.4 ERROR: out of memory issues
Date: 2022-07-14 19:39:31
Message-ID: CAFj8pRBFudan1P2aP+RwoUrxVBD6D2-Os963BFxjqae00nscAg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

čt 14. 7. 2022 v 21:26 odesílatel Aleš Zelený <zeleny(dot)ales(at)gmail(dot)com>
napsal:

>
> st 13. 7. 2022 v 2:20 odesílatel Michael Paquier <michael(at)paquier(dot)xyz>
> napsal:
>
>> On Mon, Jul 11, 2022 at 10:50:23AM +0200, Aleš Zelený wrote:
>> > So far, it has happened three times (during a single week) from the
>> 14.3 ->
>> > 14.4 upgrade, before 14.4 we haven't suffered from such an issue.
>> >
>> > Questions:
>> > 1) Can we safely downgrade from 14.4 to 14.3 by shutting down the
>> instance
>> > and reinstalling 14.3 PG packages (to prove, that the issue disappear)?
>> > 2) What is the best way to diagnose what is the root cause?
>>
>> Hmm. 14.4 has nothing in its release notes that would point to a
>> change in the vacuum or autovacuum's code paths:
>> https://www.postgresql.org/docs/14/release-14-4.html#id-1.11.6.5.4
>>
>> There is nothing specific after a look at the changes as of, and I am
>> not grabbing anything that would imply a change in memory context
>> handling either:
>> `git log --stat REL_14_3..REL_14_4`
>> `git diff REL_14_3..REL_14_4 -- *.c`
>>
>> Saying that, you should be able to downgrade safely as there are no
>> changes in WAL format or such that would break things. Saying that,
>> the corruption issue caused by CONCURRENTLY is something you'd still
>> have to face.
>>
>>
> Thanks, good to know that, we can use it for a test case, since we already
> hit the CONCURRENTLY bug on 14.3.
>
> > 2022-07-02 14:48:07 CEST [3930]: [3-1] user=,db=,host=,app= ERROR: out
>> of
>> > memory
>> > 2022-07-02 14:48:07 CEST [3930]: [4-1] user=,db=,host=,app= DETAIL:
>> Failed
>> > on request of size 152094068 in memory context "TopTransactionContext".
>> > 2022-07-02 14:48:07 CEST [3930]: [5-1] user=,db=,host=,app= CONTEXT:
>> > automatic vacuum of table "prematch.replication.tab_queue_tmp"
>>
>> This is the interesting part. Do you happen to use logical
>> replication in a custom C++ plugin?
>>
>
> We are using logical replication to other instances (pg_output) and
> decoderbufs
> https://github.com/debezium/postgres-decoderbufs for other applications.
>
>
>> > 2022-07-02 14:48:47 CEST [4476]: [43-1] user=,db=,host=,app= LOG: could
>> > not fork worker process: Cannot allocate memory
>> > terminate called after throwing an instance of 'std::bad_alloc'
>> > what(): std::bad_alloc
>> >
>> > DETAIL: parameters: $1 = '1', $2 = '1748010445', $3 = '0', $4 = '1000'
>> > terminate
>> > called after throwing an instance of 'std::bad_alloc' terminate called
>> > after throwing an instance of 'std::bad_alloc' what(): what():
>> > std::bad_allocstd::bad_alloc 2022-07-08 14:54:23 CEST [4476]: [49-1]
>> > user=,db=,host=,app= LOG: background worker "parallel worker" (PID
>> 25251)
>> > was terminated by signal 6: Aborted
>> > 2022-07-08 14:54:23 CEST [4476]: [51-1] user=,db=,host=,app= LOG:
>> > terminating any other active server processes
>>
>> Looks like something is going wrong in the memory handling of one of
>> your C++ extensions here. If you can isolate an issue using a query
>> without any custom code, that would be a Postgres problem, but I think
>> that you are missing a trick in it.
>>
>
> Here are extensions installed in the database served by the cluster:
> prematch=# \dx
> List of installed extensions
> Name | Version | Schema |
> Description
>
> --------------------+---------+------------+------------------------------------------------------------------------
> amcheck | 1.3 | public | functions for verifying
> relation integrity
> dblink | 1.2 | public | connect to other PostgreSQL
> databases from within a database
> file_fdw | 1.0 | public | foreign-data wrapper for flat
> file access
> hstore | 1.8 | public | data type for storing sets of
> (key, value) pairs
> hypopg | 1.3.1 | public | Hypothetical indexes for
> PostgreSQL
> pageinspect | 1.9 | public | inspect the contents of
> database pages at a low level
> pg_buffercache | 1.3 | public | examine the shared buffer
> cache
> pg_stat_kcache | 2.2.0 | public | Kernel statistics gathering
> pg_stat_statements | 1.9 | public | track planning and execution
> statistics of all SQL statements executed
> pgcrypto | 1.3 | public | cryptographic functions
> pgstattuple | 1.5 | public | show tuple-level statistics
> plpgsql | 1.0 | pg_catalog | PL/pgSQL procedural language
> plpgsql_check | 2.1 | public | extended check for plpgsql
> functions
> postgres_fdw | 1.1 | public | foreign-data wrapper for
> remote PostgreSQL servers
> tablefunc | 1.0 | public | functions that manipulate
> whole tables, including crosstab
> (15 rows)
>
> When we started experiencing these issues, based on Prometheus
> node-exporter statistics, I've begun collecting /proc/<pid>/status for all
> PG instance processes in an infinite loop.
> From this, I can extract some data. It looks like RssAnon memory is what I
> can see growing from several hundred kB to 1-2GB over some period (depends
> on workload, but usually 24 hours).
> The sessions where RssAnon memory growth are PHP client applications,
> having multiple connections it might consume 10GB of memory. The same
> application (I've asked for deployed changes, but no one was released close
> to the upgrade time).
> Comparing Prometheus server data, RssAnon seems to directly follow
> node_memory_Active_anon_bytes node exporter metrics. If the "suspected"
> sessions disconnects, node_memory_Active_anon_bytes drops down.
> I've checked node_memory_Active_anon_bytes metric before the upgrade and
> (same application workload) it was pretty stable on PG 13.8, later 14.3 but
> immediately after the upgrade, node_memory_Active_anon_bytes start growth,
> and once the Commit limit was reached, the fork()/Out of memory issues
> appears.
>
> So for the application user running the PHP, I've set
> log_min_duration_statemenet = 0, but no unexpected statements or arguments
> (some functions have int array argument) were found in the logfile.
>
> Checking the RssAnon from proc/pid/status I've found some points where
> RssAnon memory usage grew very steep for a minute, but no "suspicious"
> queries/arguments were found in the instance logfile.
>
> Any hint, on how to get the root cause would be appreciated since so far
> I've failed to isolate the issue reproducible testcase.
> At least I hope that looking for the RssAnon process memory is an
> appropriate metric, if not, let me know and I'll try to update the
> monitoring to get the root cause.
>
> I can imagine a workaround with client application regular reconnect...,
> but u to 14.3 it works, so I'd like to fix the issue either on our
> application side or at PG side if it is a PG problem.
>

Theoretically, there can be memory leak in libpq. There were some changes.

Regards

Pavel

> Kind regards Ales Zeleny
>
> --
>> Michael
>>
>

In response to

Browse pgsql-general by date

  From Date Subject
Next Message DAVID ROTH 2022-07-14 19:57:14 Re: Oracle to Postgress Migration
Previous Message Aleš Zelený 2022-07-14 19:25:54 Re: PostgreSQL 14.4 ERROR: out of memory issues