PG16 walsender hangs in ResourceArrayEnlarge using pgoutput

From: Bowen Shi <zxwsbg12138(at)gmail(dot)com>
To: PostgreSQL mailing lists <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: PG16 walsender hangs in ResourceArrayEnlarge using pgoutput
Date: 2024-06-25 07:52:52
Message-ID: CAM_vCudv8dc3sjWiPkXx5F2b27UV7_YRKRbtSCcE-pv=cVACGA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Dears,

Recently I found a PG16 walsender hanging in the huge loop when I used the
subscription. This problem can reproduce with following steps:

1. create a partition table;
2. create a publication & subscription;
3. alter partition table with 'add column col varchar(10) NOT NULL DEFAULT
''::character varying' ;
4. insert/update 10000000 rows into partition table;

Then we can find the walsender process hang in the loop with stack:

#0 ResourceArrayAdd (resarr=0x2dfc048, value=140175392354560)
#1 ResourceArrayEnlarge (resarr=0x2dfc048)
#2 ResourceOwnerEnlargeTupleDescs (owner=0x2dfbf68)
#3 IncrTupleDescRefCount (tupdesc=0x7f7d2077b100)
#4 MakeTupleTableSlot (tupleDesc=0x7f7d2077b100, tts_ops=0xd237a0
<TTSOpsVirtual>)
#5 pgoutput_change (ctx=0x2eb1e68, txn=0x2edfec0, relation=0x7f7d2077e120,
change=0x2f35138)
#6 change_cb_wrapper (cache=0x2eb3e78, txn=0x2edfec0,
relation=0x7f7d2077e120, change=0x2f35138)
#7 ReorderBufferApplyChange (rb=0x2eb3e78, txn=0x2edfec0,
relation=0x7f7d2077e120, change=0x2f35138, streaming=false)
#8 ReorderBufferProcessTXN (rb=0x2eb3e78, txn=0x2edfec0,
commit_lsn=4675920936, snapshot_now=0x2ec82d0, command_id=0,
streaming=false)
#9 ReorderBufferReplay (txn=0x2edfec0, rb=0x2eb3e78, xid=758,
commit_lsn=4675920936, end_lsn=4675920984, commit_time=772615467516504,
origin_id=0, origin_lsn=0)
#10 ReorderBufferCommit (rb=0x2eb3e78, xid=758, commit_lsn=4675920936,
end_lsn=4675920984, commit_time=772615467516504, origin_id=0, origin_lsn=0)
#11 DecodeCommit (ctx=0x2eb1e68, buf=0x7ffe83c408f0, parsed=0x7ffe83c406f0,
xid=758, two_phase=false)
#12 xact_decode (ctx=0x2eb1e68, buf=0x7ffe83c408f0)
#13 LogicalDecodingProcessRecord (ctx=0x2eb1e68, record=0x2eb2200)
#14 XLogSendLogical ()
#15 WalSndLoop (send_data=0x95b183 <XLogSendLogical>)
#16 StartLogicalReplication (cmd=0x2e8fcd0)
#17 exec_replication_command (cmd_string=0x2dc3dc8 "START_REPLICATION SLOT
\"sub_t\" LOGICAL 0/0 (proto_version '4', origin 'any', publication_names
'\"pub_t\"')")
#18 PostgresMain (dbname=0x2dfa7c8 "tdts", username=0x2dfa7a8 "postgres")
#19 BackendRun (port=0x2dedd60)
#20 BackendStartup (port=0x2dedd60)
#21 ServerLoop ()
#22 PostmasterMain (argc=3, argv=0x2dbe920)

In pgoutput.c , I found some reasons:
1. In init_tuple_slot() function, we will build an attrmap when we publish
a partition table (entry->publish_as_relid != RelationGetRelid(relation))
and alter missing column into partition table with alter table statement;
2. In pgoutput_change(), we would convert the tuple if needed using
MakeTupleTableSlot() function;
3. When we update 30000000 rows,
MakeTupleTableSlot->PinTupleDesc->IncrTupleDescRefCount->ResourceOwnerEnlargeTupleDescs->ResourceArrayEnlarge,
we would call ResourceArrayAdd function too many times and cause the
problem.

Regards,
Bowen Shi

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Masahiko Sawada 2024-06-25 08:26:11 Re: Potential data loss due to race condition during logical replication slot creation
Previous Message Amit Kapila 2024-06-25 04:24:10 Re: Potential data loss due to race condition during logical replication slot creation