RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: "Andrey M(dot) Borodin" <x4mmm(at)yandex-team(dot)ru>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Lakshmi Narayanan Sreethar <lakshmi(at)timescale(dot)com>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: RE: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Date: 2023-08-02 06:12:25
Message-ID: OS0PR01MB571615195FAF7BB85E935C74940BA@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tuesday, August 1, 2023 9:15 PM Andrey M. Borodin <x4mmm(at)yandex-team(dot)ru> wrote:

Hi,

> > On 24 Jan 2023, at 10:35, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Sat, Jan 21, 2023 at 5:06 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >>
> >>
> >> I have verified it and it looks good to me. I'll push and backpatch
> >> this till PG14 on Tuesday unless there are more comments/suggestions.
> >>
> >
> > Pushed.
>
> Hi!
>
> I've observed similar problem on one of our production clusters. Unfortunately,
> I barely managed to capture some debug information. Cluster was deleted
> right after I started investigating the problem.
>
> PG 15.3.
>
> "Drop database" was hanging in
> (gdb) bt
> #0 0x00007f215baa7907 in epoll_wait (epfd=13, events=0x563a71d25278,
> maxevents=1, timeout=timeout(at)entry=5000)
> at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> #1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1,
> occurred_events=0x7fffe77d40b0, cur_timeout=5000, set=0x563a71d25218)
> at ./build/../src/backend/storage/ipc/latch.c:1489
> #2 WaitEventSetWait (set=0x563a71d25218, timeout=timeout(at)entry=5000,
> occurred_events=occurred_events(at)entry=0x7fffe77d4100,
> nevents=nevents(at)entry=1,
> wait_event_info=wait_event_info(at)entry=134217765)
> at ./build/../src/backend/storage/ipc/latch.c:1435
> #3 0x0000563a71388c88 in WaitLatch (latch=<optimized out>,
> wakeEvents=wakeEvents(at)entry=41, timeout=timeout(at)entry=5000,
> wait_event_info=wait_event_info(at)entry=134217765)
> at ./build/../src/backend/storage/ipc/latch.c:497
> #4 0x0000563a713951f6 in ConditionVariableTimedSleep
> (cv=cv(at)entry=0x7f2150627f1c, timeout=timeout(at)entry=5000,
> wait_event_info=wait_event_info(at)entry=134217765)
> at ./build/../src/backend/storage/lmgr/condition_variable.c:163
> #5 0x0000563a7138e93b in WaitForProcSignalBarrier (generation=2)
> at ./build/../src/backend/storage/ipc/procsignal.c:415
> #6 0x0000563a711b2339 in dropdb (dbname=<optimized out>,
> missing_ok=<optimized out>, force=<optimized out>)
> at ./build/../src/backend/commands/dbcommands.c:1690
> #7 0x0000563a713b3ef3 in standard_ProcessUtility
> (pstmt=pstmt(at)entry=0x563a71d2ef18,
> queryString=queryString(at)entry=0x563a71d2e218 "DROP DATABASE \"db1\"",
> readOnlyTree=<optimized out>,
> context=context(at)entry=PROCESS_UTILITY_TOPLEVEL,
> params=params(at)entry=0x0, queryEnv=queryEnv(at)entry=0x0,
> dest=0x563a71d2eff8, qc=0x7fffe77d4860)
> at ./build/../src/backend/tcop/utility.c:797
> #8 0x00007f2154894443 in pgss_ProcessUtility (pstmt=0x563a71d2ef18,
> queryString=0x563a71d2e218 "DROP DATABASE \"db1\"",
> readOnlyTree=<optimized out>, context=PROCESS_UTILITY_TOPLEVEL,
> params=0x0,
> queryEnv=0x0, dest=0x563a71d2eff8, qc=0x7fffe77d4860)
> at ./build/../contrib/pg_stat_statements/pg_stat_statements.c:1206
> #9 0x0000563a713b2015 in PortalRunUtility
> (portal=portal(at)entry=0x563a71e1af58, pstmt=pstmt(at)entry=0x563a71d2ef18,
> isTopLevel=isTopLevel(at)entry=true,
> setHoldSnapshot=setHoldSnapshot(at)entry=false,
> dest=0x563a71d2eff8, qc=0x7fffe77d4860)
> at ./build/../src/backend/tcop/pquery.c:1158
> #10 0x0000563a713b2140 in PortalRunMulti
> (portal=portal(at)entry=0x563a71e1af58, isTopLevel=isTopLevel(at)entry=true,
> setHoldSnapshot=setHoldSnapshot(at)entry=false,
> dest=dest(at)entry=0x563a71d2eff8,
> altdest=altdest(at)entry=0x563a71d2eff8, qc=qc(at)entry=0x7fffe77d4860)
> at ./build/../src/backend/tcop/pquery.c:1322
> #11 0x0000563a713b25cb in PortalRun (portal=0x563a71e1af58,
> count=9223372036854775807, isTopLevel=<optimized out>,
> run_once=<optimized out>, dest=0x563a71d2eff8, altdest=0x563a71d2eff8,
> qc=0x7fffe77d4860)
> at ./build/../src/backend/tcop/pquery.c:791
> #12 0x0000563a71054f5e in exec_simple_query
> (query_string=0x563a71d2e218 "DROP DATABASE \"db1\"")
> at ./build/../src/backend/tcop/postgres.c:1250
> #13 0x0000563a713b0c64 in PostgresMain (dbname=<optimized out>,
> username=<optimized out>) at ./build/../src/backend/tcop/postgres.c:4593
> #14 0x0000563a7132385a in BackendRun (port=<optimized out>,
> port=<optimized out>)
> at ./build/../src/backend/postmaster/postmaster.c:4518
> #15 BackendStartup (port=<optimized out>)
> at ./build/../src/backend/postmaster/postmaster.c:4246
> #16 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806
> #17 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>)
> at ./build/../src/backend/postmaster/postmaster.c:1478
> #18 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0)
> at ./build/../src/backend/main/main.c:202
> (gdb) select-frame 5
> (gdb) p ProcSignal->psh_slot[i]
> $1 = {pss_pid = 9131, pss_signalFlags = {0 <repeats 12 times>},
> pss_barrierGeneration = {value = 0}, pss_barrierCheckMask = {value = 1},
> pss_barrierCV = {mutex = 0 '\000', wakeup = {head = 1565, tail = 1565}}}
> (gdb) p generation
> $2 = 2
>
>
> There were 3 inactive logical replication slots with disabled interrupts and bt
> like this
> 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150,
> maxevents=1, timeout=timeout(at)entry=-1)
> at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> #0 0x00007f215baa7907 in epoll_wait (epfd=6, events=0x563a71d26150,
> maxevents=1, timeout=timeout(at)entry=-1)
> at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
> #1 0x0000563a713888db in WaitEventSetWaitBlock (nevents=1,
> occurred_events=0x1, cur_timeout=-1, set=0x563a71d260d8)
> at ./build/../src/backend/storage/ipc/latch.c:1489
> #2 WaitEventSetWait (set=0x563a71d260d8, timeout=timeout(at)entry=-1,
> occurred_events=occurred_events(at)entry=0x7fffe77d4850,
> nevents=nevents(at)entry=1,
> wait_event_info=wait_event_info(at)entry=100663296)
> at ./build/../src/backend/storage/ipc/latch.c:1435
> #3 0x0000563a71287747 in secure_read (port=0x563a71d58780,
> ptr=0x563a719c7140 <PqRecvBuffer>, len=8192)
> at ./build/../src/backend/libpq/be-secure.c:186
> #4 0x0000563a7128e463 in pq_recvbuf ()
> at ./build/../src/backend/libpq/pqcomm.c:957
> #5 0x0000563a7128f495 in pq_getbyte ()
> at ./build/../src/backend/libpq/pqcomm.c:1000
> #6 0x0000563a713aed02 in SocketBackend (inBuf=0x7fffe77d4a30)
> at ./build/../src/backend/tcop/postgres.c:351
> #7 ReadCommand (inBuf=0x7fffe77d4a30)
> at ./build/../src/backend/tcop/postgres.c:474
> #8 PostgresMain (dbname=<optimized out>, username=<optimized out>)
> at ./build/../src/backend/tcop/postgres.c:4525
> #9 0x0000563a7132385a in BackendRun (port=<optimized out>,
> port=<optimized out>)
> at ./build/../src/backend/postmaster/postmaster.c:4518
> #10 BackendStartup (port=<optimized out>)
> at ./build/../src/backend/postmaster/postmaster.c:4246
> #11 ServerLoop () at ./build/../src/backend/postmaster/postmaster.c:1806
> #12 0x0000563a7132478c in PostmasterMain (argc=9, argv=<optimized out>)
> at ./build/../src/backend/postmaster/postmaster.c:1478
> #13 0x0000563a7105607b in main (argc=9, argv=0x563a71d240e0)
> at ./build/../src/backend/main/main.c:202
>
> in pg_stat_activity query was like "create table...", but backend type was
> walsender and it was idle.
>
> Unfortunately, that's all information that I could collect. I'd appreciate any hints
> on what information to collect if I'll see this happen again.

I think it would be better to get the whole info of pg_stat_activity when DROP
database hangs. And the log is also needed as it may include log like "still
waiting for backend with PID xxx to ...".

Besides, you mentioned there were inactive slots, but just to confirm, is there
the logical replication running(active) in your environment ? If yes, it would be better
to collect the bt of the logical replication workers if exists. This is to
check if the hang is related to the logical replication or not.

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andrey M. Borodin 2023-08-02 08:34:17 Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Previous Message Amit Kapila 2023-08-02 05:20:28 Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1