Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1

From: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com>, Lakshmi Narayanan Sreethar <lakshmi(at)timescale(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Date: 2023-08-01 13:15:00
Message-ID: EF98BB5B-CA83-443E-B8A6-AA58EE4A06BB@yandex-team.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

> On 24 Jan 2023, at 10:35, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Sat, Jan 21, 2023 at 5:06 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>>
>>
>> I have verified it and it looks good to me. I'll push and backpatch
>> this till PG14 on Tuesday unless there are more comments/suggestions.
>>
>
> Pushed.

Hi!

I've observed similar problem on one of our production clusters. Unfortunately, I barely managed to capture some debug information. Cluster was deleted right after I started investigating the problem.

PG 15.3.

"Drop database" was hanging in
(gdb) bt
#0 0x00007f215baa7907 in epoll_wait (epfd=13, events=0x563a71d25278, maxevents=1, timeout=timeout(at)entry=5000) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1, occurred_events=0x7fffe77d40b0, cur_timeout=5000, set=0x563a71d25218) at ./build/../src/backend/storage/ipc/latch.c:1489
#2 WaitEventSetWait (set=0x563a71d25218, timeout=timeout(at)entry=5000, occurred_events=occurred_events(at)entry=0x7fffe77d4100, nevents=nevents(at)entry=1, wait_event_info=wait_event_info(at)entry=134217765)
at ./build/../src/backend/storage/ipc/latch.c:1435
#3 0x0000563a71388c88 in WaitLatch (latch=<optimized out>, wakeEvents=wakeEvents(at)entry=41, timeout=timeout(at)entry=5000, wait_event_info=wait_event_info(at)entry=134217765)
at ./build/../src/backend/storage/ipc/latch.c:497
#4 0x0000563a713951f6 in ConditionVariableTimedSleep (cv=cv(at)entry=0x7f2150627f1c, timeout=timeout(at)entry=5000, wait_event_info=wait_event_info(at)entry=134217765)
at ./build/../src/backend/storage/lmgr/condition_variable.c:163
#5 0x0000563a7138e93b in WaitForProcSignalBarrier (generation=2) at ./build/../src/backend/storage/ipc/procsignal.c:415
#6 0x0000563a711b2339 in dropdb (dbname=<optimized out>, missing_ok=<optimized out>, force=<optimized out>) at ./build/../src/backend/commands/dbcommands.c:1690
#7 0x0000563a713b3ef3 in standard_ProcessUtility (pstmt=pstmt(at)entry=0x563a71d2ef18, queryString=queryString(at)entry=0x563a71d2e218 "DROP DATABASE \"db1\"", readOnlyTree=<optimized out>,
context=context(at)entry=PROCESS_UTILITY_TOPLEVEL, params=params(at)entry=0x0, queryEnv=queryEnv(at)entry=0x0, dest=0x563a71d2eff8, qc=0x7fffe77d4860) at ./build/../src/backend/tcop/utility.c:797
#8 0x00007f2154894443 in pgss_ProcessUtility (pstmt=0x563a71d2ef18, queryString=0x563a71d2e218 "DROP DATABASE \"db1\"", readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL, params=0x0,
queryEnv=0x0, dest=0x563a71d2eff8, qc=0x7fffe77d4860) at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1206
#9 0x0000563a713b2015 in PortalRunUtility (portal=portal(at)entry=0x563a71e1af58, pstmt=pstmt(at)entry=0x563a71d2ef18, isTopLevel=isTopLevel(at)entry=true, setHoldSnapshot=setHoldSnapshot(at)entry=false,
dest=0x563a71d2eff8, qc=0x7fffe77d4860) at ./build/../src/backend/tcop/pquery.c:1158
#10 0x0000563a713b2140 in PortalRunMulti (portal=portal(at)entry=0x563a71e1af58, isTopLevel=isTopLevel(at)entry=true, setHoldSnapshot=setHoldSnapshot(at)entry=false, dest=dest(at)entry=0x563a71d2eff8,
altdest=altdest(at)entry=0x563a71d2eff8, qc=qc(at)entry=0x7fffe77d4860) at ./build/../src/backend/tcop/pquery.c:1322
#11 0x0000563a713b25cb in PortalRun (portal=0x563a71e1af58, count=9223372036854775807, isTopLevel=<optimized out>, run_once=<optimized out>, dest=0x563a71d2eff8, altdest=0x563a71d2eff8, qc=0x7fffe77d4860)
at ./build/../src/backend/tcop/pquery.c:791
#12 0x0000563a71054f5e in exec_simple_query (query_string=0x563a71d2e218 "DROP DATABASE \"db1\"") at ./build/../src/backend/tcop/postgres.c:1250
#13 0x0000563a713b0c64 in PostgresMain (dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4593
#14 0x0000563a7132385a in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4518
#15 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4246
#16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806
#17 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1478
#18 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0) at ./build/../src/backend/main/main.c:202
(gdb) select-frame 5
(gdb) p ProcSignal->psh_slot[i]
$1 = {pss_pid = 9131, pss_signalFlags = {0 <repeats 12 times>}, pss_barrierGeneration = {value = 0}, pss_barrierCheckMask = {value = 1}, pss_barrierCV = {mutex = 0 '\000', wakeup = {head = 1565, tail = 1565}}}
(gdb) p generation
$2 = 2

There were 3 inactive logical replication slots with disabled interrupts and bt like this
0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150, maxevents=1, timeout=timeout(at)entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#0 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150, maxevents=1, timeout=timeout(at)entry=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
#1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1, occurred_events=0x1, cur_timeout=-1, set=0x563a71d260d8) at ./build/../src/backend/storage/ipc/latch.c:1489
#2 WaitEventSetWait (set=0x563a71d260d8, timeout=timeout(at)entry=-1, occurred_events=occurred_events(at)entry=0x7fffe77d4850, nevents=nevents(at)entry=1, wait_event_info=wait_event_info(at)entry=100663296) at ./build/../src/backend/storage/ipc/latch.c:1435
#3 0x0000563a71287747 in secure_read (port=0x563a71d58780, ptr=0x563a719c7140 <PqRecvBuffer>, len=8192) at ./build/../src/backend/libpq/be-secure.c:186
#4 0x0000563a7128e463 in pq_recvbuf () at ./build/../src/backend/libpq/pqcomm.c:957
#5 0x0000563a7128f495 in pq_getbyte () at ./build/../src/backend/libpq/pqcomm.c:1000
#6 0x0000563a713aed02 in SocketBackend (inBuf=0x7fffe77d4a30) at ./build/../src/backend/tcop/postgres.c:351
#7 ReadCommand (inBuf=0x7fffe77d4a30) at ./build/../src/backend/tcop/postgres.c:474
#8 PostgresMain (dbname=<optimized out>, username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4525
#9 0x0000563a7132385a in BackendRun (port=<optimized out>, port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4518
#10 BackendStartup (port=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:4246
#11 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806
#12 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>) at ./build/../src/backend/postmaster/postmaster.c:1478
#13 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0) at ./build/../src/backend/main/main.c:202

in pg_stat_activity query was like "create table...", but backend type was walsender and it was idle.

Unfortunately, that's all information that I could collect. I'd appreciate any hints on what information to collect if I'll see this happen again.

Best regards, Andrey Borodin.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message David Rowley 2023-08-01 13:46:19 Re: BUG #18035: Assertion failure in jsonb_path_query
Previous Message Kong Man 2023-07-31 23:00:48 Re: pg_restore 14 skips ACL COLUMN when --schema is used