Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown

From: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
To: David Kohn <djk447(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown
Date: 2018-01-30 04:07:27
Message-ID: CAEepm=0TBygUnw0MuR6HCZ5mZ483U0ur+GwEKZsKNR4+E1asAQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Tue, Jan 30, 2018 at 4:33 PM, David Kohn <djk447(at)gmail(dot)com> wrote:
> On Mon, Jan 29, 2018 at 9:07 PM Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
>> Thanks for the report! Based on the mention of BtreePage, this sounds
>> like the following bug:
>>
>> https://www.postgresql.org/message-id/flat/CAEepm%3D2xZUcOGP9V0O_G0%3D2P2wwXwPrkF%3DupWTCJSisUxMnuSg%40mail.gmail.com
>>
>>
>> The fix for that will be in 10.2 (current target date: February 8th).
>> The workaround in the meantime would be to disable parallelism, at
>> least for the queries doing parallel index scans if you can identify
>> them.
>
> That sounds great, I hope that patch will fix it, I'm not quite sure it will
> though. Some of them have workers that are in the BtreePage state, however
> at least as many of the hung queries have only workers in the
> MessageQueuePutMessage state. Would you expect the patch to fix those as
> well? Or could it be something different?

Maybe like this:

1. Leader process encounters the bug and starts waiting for itself
forever (caused by encountering concurrently deleted btree pages on a
busy system, see that other thread for gory details). This looks like
wait event = BtreePage.
2. Worker backend has emitted a bunch of tuples and fills up its
output tuple queue, but the leader isn't reading from the queue, so
the worker waits forever. This looks like wait event =
MessageQueuePutMessage.

The second thing is just expected and correct behaviour in workers if
the leader process is jammed.

>> However, I'm not entirely sure why you're not able to cancel these
>> backends politely with pg_cancel_backend(). For example, the
>> BtreePage waiter should be in ConditionVariableSleep() and should be
>> interrupted by such a signal and error out in CHECK_FOR_INTERRUPTS().
>
> All of them are definitely un-killable by anything other than a kill -9 that
> I've found so far. I have a feeling it has something to do with:
> https://jobs.zalando.com/tech/blog/hack-to-terminate-tcp-conn-postgres/?gh_src=4n3gxh1
> but I'm not 100% sure, as I didn't set tcp settings low enough to make
> catching a packet all that reasonable. I'm happy to try to investigate
> further, I just don't quite know what that should entail. If you have things
> that you think would be helpful, please do let me know.

Hmm. Well usually in a case like this the most useful thing would
usually be a backtrace ("gdb /path/to/binary -p PID", then "bt") to
show exactly where they're stuck. But in this case we already know
more-or-less where they're waiting (the wait event names tell us), and
the real question is: why on earth aren't the wait loops responding to
SIGINT and SIGTERM? I wonder if there might be something funky about
parallel query + statement timeouts.

--
Thomas Munro
http://www.enterprisedb.com

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Meirav Rath 2018-01-30 04:12:37 Re: BUG #15035: scram-sha-256 blocks all logins
Previous Message David Kohn 2018-01-30 03:33:17 Re: BUG #15036: Un-killable queries Hanging in BgWorkerShutdown