Diagnose memory leak in logical replication?

From: Raphaël Enrici <raphael(dot)enrici(at)yadle(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Diagnose memory leak in logical replication?
Date: 2020-11-05 16:51:31
Message-ID: 086c8efd4338c40ec51f312bf160a824@yadle.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Dear all,

Hope everyone is doing well.

While playing with wal2json and debezium we faced an issue where
walsender processes eat more and more memory (several hundred megabytes
per hour) until we restart them by stopping their consumption.

We first thought to something wrong in wal2json[1] but after more
testing today we have the exact same situation with pgoutput plugin.

The environment is Debian buster with packages from apt.postgresql.org
(currently using 12.2-2.pgdg100+1):

postgresql-12 - 12.2-2.pgdg100+1
postgresql-12-wal2json - 2.2-1.pgdg100+1
postgresql-client-12 - 12.2-2.pgdg100+1
postgresql-client-common - 213.pgdg100+1
postgresql-common - 213.pgdg100+1

The steps to create slot and consume wal2json are the following:

pg_recvlogical -d test_db -U test_u -p 5433 -W --slot test_slot
--create-slot -P wal2json
pg_recvlogical -d test_db -U test_u -p 5433 -W --slot test_slot --start
-f /dev/null

The steps done to create slots and consume with pgoutput plugin are the
following:
In PostgreSQL:
CREATE PUBLICATION dbz_minimal_publication FOR TABLE public.xxx;

And then from the shell:
pg_recvlogical -d test_db --slot pgoutput_minimal_test_slot
--create-slot -P pgoutput
pg_recvlogical -d test_db -U test_u -p 5433 -W --slot
pgoutput_minimal_test_slot --start -h localhost -o proto_version=1 -o
publication_names='dbz_minimal_publication' -f /dev/null

A few minutes after creating these slots (Thu Nov 5 12:55:45 2020)
Every 5.0s: ps eo user,pid,vsz,rss $(pgrep -f walsender)
USER PID VSZ RSS
postgres 3080 11672352 916664 <- this is the walsender using wal2json
plugin
postgres 5740 11533412 760332 <- this is the walsender using pgoutput
plugin
...
postgres 14333 13011864 2246328 <- those are other walsender using
wal2json plugin
postgres 14383 13011864 2244788
postgres 14384 13011864 2246316

Same thing a few hours later: Thu Nov 5 14:01:55 2020
Every 5.0s: ps eo user,pid,vsz,rss $(pgrep -f walsender)
USER PID VSZ RSS
postgres 3080 12153832 1400920 <- this is the walsender using wal2json
plugin
postgres 5740 12014716 1245524 <- this is the walsender using pgoutput
plugin
...
postgres 14333 13493032 2728520
postgres 14383 13493032 2727012
postgres 14384 13493032 2728672

Per wal2json author suggestion and based on other posts on this list we
attached gdb to one of the walsender process a few weeks ago and did the
following:
p (long)MemoryContextStats((long)TopMemoryContext)

It seems we have a lot of ReorderBufferToastHash under the Logical
decoding context:
TopMemoryContext: 435000 total in 9 blocks; 29864 free (17 chunks);
405136 used
Type information cache: 24352 total in 2 blocks; 2624 free (0 chunks);
21728 used
pgstat TabStatusArray lookup hash table: 8192 total in 1 blocks; 1456
free (0 chunks); 6736 used
wal2json output context: 8192 total in 1 blocks; 7936 free (0 chunks);
256 used
RowDescriptionContext: 8192 total in 1 blocks; 6896 free (0 chunks);
1296 used
MessageContext: 8192 total in 1 blocks; 6896 free (1 chunks); 1296
used
Replication command context: 82104 total in 5 blocks; 25848 free (12
chunks); 56256 used
Logical decoding context: 344176 total in 5 blocks; 26672 free (12
chunks); 317504 used
snapshot builder context: 524288 total in 7 blocks; 172928 free
(24 chunks); 351360 used
ReorderBuffer: 32768 total in 3 blocks; 28048 free (26 chunks);
4720 used
ReorderBufferToastHash: 524288 total in 7 blocks; 28784 free
(22 chunks); 495504 used
ReorderBufferToastHash: 524288 total in 7 blocks; 28784 free
(22 chunks); 495504 used
ReorderBufferToastHash: 524288 total in 7 blocks; 28784 free
(22 chunks); 495504 used
ReorderBufferToastHash: 524288 total in 7 blocks; 28784 free
(22 chunks); 495504 used
...
ReorderBufferByXid: 16384 total in 2 blocks; 3544 free (2
chunks); 12840 used
Tuples: 2080374888 total in 248 blocks (1259603 chunks);
241400 free (265927 chunks); 2080133488 used
TXN: 648 total in 0 blocks; 0 free (0 chunks); 648 used
Change: 79881960 total in 9751 blocks; 74080 free (926
chunks); 79807880 used
Operator class cache: 8192 total in 1 blocks; 560 free (0 chunks);
7632 used

The full output can be found on github[2]

Please note that all of this seems to be related to the DB activity,
simply doing a test with for example a pgbench in loop does not allow to
reproduce this "leak".

Would someone point me in the right direction to identify what could be
the reason for this behaviour so that we can try to fix this?

Thanks,
Raph
[1] https://github.com/eulerto/wal2json/issues/180
[2] https://github.com/eulerto/wal2json/files/5292603/memory.txt

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2020-11-05 16:57:53 Re: Fwd: JSONB order?
Previous Message Rob Sargent 2020-11-05 16:27:43 Re: JSONB order?