Re: Backend handling replication slot stuck using 100% cpu, unkillable

From: hubert depesz lubaczewski <depesz(at)depesz(dot)com>
To: pgsql-bugs mailing list <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: Backend handling replication slot stuck using 100% cpu, unkillable
Date: 2023-07-03 12:58:07
Message-ID: ZKLF3zT4kV+VUGjJ@depesz.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Jul 03, 2023 at 01:36:32PM +0200, hubert depesz lubaczewski wrote:
> Hi,
> we are using debezium to get change data from Pg.
>
> This particular pg is 12.9, and will be soon upgrade to 14.something
> (this thursday).

So, i installed dbgsym for this pg, and this bny accident upgraded pg to
12.14.

Now I do have debug symbols, though, so backtrace can be more
informative.

I ran:

for i in 1 2 3; do date; sudo gdb -batch -p 8938 -ex bt; sleep 30; echo; done

And got this:

#v+
Mon Jul 3 12:50:14 UTC 2023
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
hash_seq_search (status=status(at)entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
1439 ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0 hash_seq_search (status=status(at)entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
#1 0x0000ffffa1bc8714 in rel_sync_cache_publication_cb (arg=<optimized out>, cacheid=<optimized out>, hashvalue=<optimized out>) at ./build/../src/backend/replication/pgoutput/pgoutput.c:665
#2 0x0000aaaab4bfcef4 in CallSyscacheCallbacks (cacheid=47, hashvalue=1542357812) at ./build/../src/backend/utils/cache/inval.c:1520
#3 0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240, txn=0xfffff0174240) at ./build/../src/backend/replication/logical/reorderbuffer.c:2187
#4 ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid(at)entry=2741814901, commit_lsn=187650155969544, end_lsn=<optimized out>, commit_time=commit_time(at)entry=741514150878208, origin_id=origin_id(at)entry=0, origin_lsn=origin_lsn(at)entry=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1816
#5 0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510, ctx=0xaaaad5e1df00) at ./build/../src/backend/replication/logical/decode.c:654
#6 DecodeXactOp (ctx=ctx(at)entry=0xaaaad5e1df00, buf=0xfffff0174510, buf(at)entry=0xfffff0174570) at ./build/../src/backend/replication/logical/decode.c:249
#7 0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at ./build/../src/backend/replication/logical/decode.c:117
#8 0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
#9 0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data(at)entry=0xaaaab4a99688 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2232
#10 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at ./build/../src/backend/replication/walsender.c:1134
#11 exec_replication_command (cmd_string=cmd_string(at)entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\" LOGICAL 1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'xxx')") at ./build/../src/backend/replication/walsender.c:1602
#12 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0xaaaad5d7aaf8, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4289
#13 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4517
#14 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#16 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398
#17 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228

Mon Jul 3 12:50:45 UTC 2023
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000aaaab4c1e11c in hash_search_with_hash_value (hashp=0xaaaad5dea900, keyPtr=0xfffff0173f5c, keyPtr(at)entry=0xfffff0173f7c, hashvalue=1107843932, action=action(at)entry=HASH_FIND, foundPtr=foundPtr(at)entry=0x0) at ./build/../src/backend/utils/hash/dynahash.c:949
949 ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0 0x0000aaaab4c1e11c in hash_search_with_hash_value (hashp=0xaaaad5dea900, keyPtr=0xfffff0173f5c, keyPtr(at)entry=0xfffff0173f7c, hashvalue=1107843932, action=action(at)entry=HASH_FIND, foundPtr=foundPtr(at)entry=0x0) at ./build/../src/backend/utils/hash/dynahash.c:949
#1 0x0000aaaab4c1e79c in hash_search (hashp=<optimized out>, keyPtr=keyPtr(at)entry=0xfffff0173f7c, action=action(at)entry=HASH_FIND, foundPtr=foundPtr(at)entry=0x0) at ./build/../src/backend/utils/hash/dynahash.c:911
#2 0x0000aaaab4c07180 in RelationCacheInvalidateEntry (relationId=<optimized out>) at ./build/../src/backend/utils/cache/relcache.c:2820
#3 0x0000aaaab4bfcfe0 in LocalExecuteInvalidationMessage (msg=0xffffa033ab88) at ./build/../src/backend/utils/cache/inval.c:603
#4 0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240, txn=0xfffff0174240) at ./build/../src/backend/replication/logical/reorderbuffer.c:2187
#5 ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid(at)entry=2741814901, commit_lsn=187650155969544, end_lsn=<optimized out>, commit_time=commit_time(at)entry=741514150878208, origin_id=origin_id(at)entry=0, origin_lsn=origin_lsn(at)entry=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1816
#6 0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510, ctx=0xaaaad5e1df00) at ./build/../src/backend/replication/logical/decode.c:654
#7 DecodeXactOp (ctx=ctx(at)entry=0xaaaad5e1df00, buf=0xfffff0174510, buf(at)entry=0xfffff0174570) at ./build/../src/backend/replication/logical/decode.c:249
#8 0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at ./build/../src/backend/replication/logical/decode.c:117
#9 0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
#10 0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data(at)entry=0xaaaab4a99688 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2232
#11 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at ./build/../src/backend/replication/walsender.c:1134
#12 exec_replication_command (cmd_string=cmd_string(at)entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\" LOGICAL 1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'xxx')") at ./build/../src/backend/replication/walsender.c:1602
#13 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0xaaaad5d7aaf8, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4289
#14 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4517
#15 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
#16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#17 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398
#18 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228

Mon Jul 3 12:51:16 UTC 2023
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
hash_seq_search (status=status(at)entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
1439 ./build/../src/backend/utils/hash/dynahash.c: No such file or directory.
#0 hash_seq_search (status=status(at)entry=0xfffff0173f40) at ./build/../src/backend/utils/hash/dynahash.c:1439
#1 0x0000ffffa1bc8714 in rel_sync_cache_publication_cb (arg=<optimized out>, cacheid=<optimized out>, hashvalue=<optimized out>) at ./build/../src/backend/replication/pgoutput/pgoutput.c:665
#2 0x0000aaaab4bfcef4 in CallSyscacheCallbacks (cacheid=47, hashvalue=3011071378) at ./build/../src/backend/utils/cache/inval.c:1520
#3 0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240, txn=0xfffff0174240) at ./build/../src/backend/replication/logical/reorderbuffer.c:2187
#4 ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid(at)entry=2741814901, commit_lsn=187650155969544, end_lsn=<optimized out>, commit_time=commit_time(at)entry=741514150878208, origin_id=origin_id(at)entry=0, origin_lsn=origin_lsn(at)entry=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1816
#5 0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510, ctx=0xaaaad5e1df00) at ./build/../src/backend/replication/logical/decode.c:654
#6 DecodeXactOp (ctx=ctx(at)entry=0xaaaad5e1df00, buf=0xfffff0174510, buf(at)entry=0xfffff0174570) at ./build/../src/backend/replication/logical/decode.c:249
#7 0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at ./build/../src/backend/replication/logical/decode.c:117
#8 0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
#9 0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data(at)entry=0xaaaab4a99688 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2232
#10 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at ./build/../src/backend/replication/walsender.c:1134
#11 exec_replication_command (cmd_string=cmd_string(at)entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"slot_name\" LOGICAL 1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'xxx')") at ./build/../src/backend/replication/walsender.c:1602
#12 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0xaaaad5d7aaf8, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4289
#13 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4517
#14 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
#15 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#16 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398
#17 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228
#v-

Based on suggestion from IRC, i tried "return 0" and contiunue in gdb session.

backtrace changed to:

#v+
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
0x0000aaaab4bfcf84 in LocalExecuteInvalidationMessage (msg=0xffff9ff33e18) at ./build/../src/backend/utils/cache/inval.c:578
578 ./build/../src/backend/utils/cache/inval.c: No such file or directory.
#0 0x0000aaaab4bfcf84 in LocalExecuteInvalidationMessage (msg=0xffff9ff33e18) at ./build/../src/backend/utils/cache/inval.c:578
#1 0x0000aaaab4a91884 in ReorderBufferExecuteInvalidations (rb=0xffffa4ac5308 <malloc+160>, txn=0xfffff0174240, txn=0xfffff0174240) at ./build/../src/backend/replication/logical/reorderbuffer.c:2187
#2 ReorderBufferCommit (rb=0xffffa4ac5308 <malloc+160>, xid=xid(at)entry=2741814901, commit_lsn=187650155969544, end_lsn=<optimized out>, commit_time=commit_time(at)entry=741514150878208, origin_id=origin_id(at)entry=0, origin_lsn=origin_lsn(at)entry=0) at ./build/../src/backend/replication/logical/reorderbuffer.c:1816
#3 0x0000aaaab4a869bc in DecodeCommit (xid=2741814901, parsed=0xfffff0174390, buf=0xfffff0174510, ctx=0xaaaad5e1df00) at ./build/../src/backend/replication/logical/decode.c:654
#4 DecodeXactOp (ctx=ctx(at)entry=0xaaaad5e1df00, buf=0xfffff0174510, buf(at)entry=0xfffff0174570) at ./build/../src/backend/replication/logical/decode.c:249
#5 0x0000aaaab4a86ad4 in LogicalDecodingProcessRecord (ctx=0xaaaad5e1df00, record=0xaaaad5e1e198) at ./build/../src/backend/replication/logical/decode.c:117
#6 0x0000aaaab4a996ec in XLogSendLogical () at ./build/../src/backend/replication/walsender.c:2883
#7 0x0000aaaab4a9bbb0 in WalSndLoop (send_data=send_data(at)entry=0xaaaab4a99688 <XLogSendLogical>) at ./build/../src/backend/replication/walsender.c:2232
#8 0x0000aaaab4a9c674 in StartLogicalReplication (cmd=0xaaaad5e47f90) at ./build/../src/backend/replication/walsender.c:1134
#9 exec_replication_command (cmd_string=cmd_string(at)entry=0xaaaad5d1db00 "START_REPLICATION SLOT \"data_access_platform_cdc\" LOGICAL 1D6C/92965050 (\"proto_version\" '1', \"publication_names\" 'cdc')") at ./build/../src/backend/replication/walsender.c:1602
#10 0x0000aaaab4af0c08 in PostgresMain (argc=<optimized out>, argv=argv(at)entry=0xaaaad5d7aaf8, dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4289
#11 0x0000aaaab4a759a8 in BackendRun (port=0xaaaad5d76150, port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4517
#12 BackendStartup (port=0xaaaad5d76150) at ./build/../src/backend/postmaster/postmaster.c:4200
#13 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1725
#14 0x0000aaaab4a769d4 in PostmasterMain (argc=<optimized out>, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1398
#15 0x0000aaaab480355c in main (argc=5, argv=0xaaaad5d16720) at ./build/../src/backend/main/main.c:228
#v-

and then it went back to hash_seq_search :(

Anything I can do about it?

Best regards,

depesz

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David G. Johnston 2023-07-03 13:41:46 Re: group by can use alias from select list but not the having clause
Previous Message hubert depesz lubaczewski 2023-07-03 11:36:32 Backend handling replication slot stuck using 100% cpu, unkillable