From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org,Merlin Moncure <mmoncure(at)gmail(dot)com>,PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: database stuck in __epoll_wait_nocancel(). Are infinite timeouts safe? |
Date: | 2020-03-13 19:28:37 |
Message-ID: | 6E5A7EFA-A1BF-4C7A-99F6-11ADA24FAD80@anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On March 13, 2020 12:08:32 PM PDT, Merlin Moncure <mmoncure(at)gmail(dot)com> wrote:
>I have 5 servers in a testing environment that are comprise a data
>warehousing cluster. They will typically get each get exactly the
>same query at approximately the same time. Yesterday, around 1pm, 3
>of the five got stuck on the same query. Each of them yields similar
>stack traces. This happens now and then. The server is 9.6.12
>(which is obviously old, but I did not see any changes in relevant
>code).
>
>(gdb) bt
>#0 0x00007fe856c0b463 in __epoll_wait_nocancel () from
>/lib64/libc.so.6
>#1 0x00000000006b4416 in WaitEventSetWaitBlock (nevents=1,
>occurred_events=0x7ffc9f2b0f60, cur_timeout=-1, set=0x27cace8) at
>latch.c:1053
>#2 WaitEventSetWait (set=0x27cace8, timeout=timeout(at)entry=-1,
>occurred_events=occurred_events(at)entry=0x7ffc9f2b0f60,
>nevents=nevents(at)entry=1) at latch.c:1007
>#3 0x00000000005f26dd in secure_write (port=0x27f16a0,
>ptr=ptr(at)entry=0x27f5528, len=len(at)entry=192) at be-secure.c:255
>#4 0x00000000005fb51b in internal_flush () at pqcomm.c:1410
>#5 0x00000000005fb72a in internal_putbytes (s=0x2a4f245 "14M04",
>s(at)entry=0x2a4f228 "", len=70) at pqcomm.c:1356
>#6 0x00000000005fb7f0 in socket_putmessage (msgtype=68 'D',
>s=0x2a4f228 "", len=<optimized out>) at pqcomm.c:1553
>#7 0x00000000005fd5d9 in pq_endmessage (buf=buf(at)entry=0x7ffc9f2b1040)
>at pqformat.c:347
>#8 0x0000000000479a63 in printtup (slot=0x2958fc8, self=0x2b6bca0) at
>printtup.c:372
>#9 0x00000000005c1cc9 in ExecutePlan (dest=0x2b6bca0,
>direction=<optimized out>, numberTuples=0, sendTuples=1 '\001',
>operation=CMD_SELECT,
> use_parallel_mode=<optimized out>, planstate=0x2958cf8,
>estate=0x2958be8) at execMain.c:1606
>#10 standard_ExecutorRun (queryDesc=0x2834998, direction=<optimized
>out>, count=0) at execMain.c:339
>#11 0x00000000006d69a7 in PortalRunSelect
>(portal=portal(at)entry=0x2894e38, forward=forward(at)entry=1 '\001',
>count=0, count(at)entry=9223372036854775807,
> dest=dest(at)entry=0x2b6bca0) at pquery.c:948
>#12 0x00000000006d7dbb in PortalRun (portal=0x2894e38,
>count=9223372036854775807, isTopLevel=<optimized out>, dest=0x2b6bca0,
>altdest=0x2b6bca0,
> completionTag=0x7ffc9f2b14e0 "") at pquery.c:789
>#13 0x00000000006d5a06 in PostgresMain (argc=<optimized out>,
>argv=<optimized out>, dbname=<optimized out>, username=<optimized
>out>) at postgres.c:1109
>#14 0x000000000046fc28 in BackendRun (port=0x27f16a0) at
>postmaster.c:4342
>#15 BackendStartup (port=0x27f16a0) at postmaster.c:4016
>#16 ServerLoop () at postmaster.c:1721
>#17 0x0000000000678119 in PostmasterMain (argc=argc(at)entry=3,
>argv=argv(at)entry=0x27c8c90) at postmaster.c:1329
>#18 0x000000000047088e in main (argc=3, argv=0x27c8c90) at main.c:228
>(gdb) quit
>
>Now, the fact that this happened to multiple servers at time strongly
>suggest an external (to the database) problem. The system initiating
>the query, a cross database query over dblink, has been has given up
>(and has been restarted as a precaution) a long time ago, and the
>connection is dead. secure_write() sets however an infinite timeout
>to the latch, and there are clearly scenarios where epoll waits
>forever for an event that is never going to occur. If/when this
>happens, the only recourse is to restart the impacted database. The
>question is, shouldn't the latch have a looping timeout that checks
>for interrupts? What would the risks be of jumping directly out of
>the latch loop?
Unless there is a kernel problem latches are interruptible by signals, as the signal handler should do a SetLatch().
This backtrace just looks like the backend is trying to send data to the client? What makes you think it's stuck?
If the connection is dead, epoll should return (both because we ask for the relevant events, and because it just always implicitly does do so).
So it seems likely that either your connection isn't actually dead (e.g. waiting for tcp timeouts), or you have a kennel bug.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2020-03-13 19:42:50 | Re: [PATCH] Incremental sort (was: PoC: Partial sort) |
Previous Message | Tom Lane | 2020-03-13 19:27:52 | Re: [PATCH] Incremental sort (was: PoC: Partial sort) |