From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Parallel worker hangs while handling errors. |
Date: | 2020-08-07 15:36:45 |
Message-ID: | 2754818.1596814605@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Robert Haas <robertmhaas(at)gmail(dot)com> writes:
> On Fri, Aug 7, 2020 at 5:05 AM Bharath Rupireddy
> <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> wrote:
>> We intend to unblock SIGQUIT before sigsetjmp() in places like
>> bgwriter, checkpointer, walwriter and walreceiver, but we only call
>> sigdelset(&BlockSig, SIGQUIT);, Without PG_SETMASK(&BlockSig);, seems
>> like we are not actually unblocking SIQUIT and quickdie() will never
>> get called in these processes if (sigsetjmp(local_sigjmp_buf, 1) !=
>> 0){....}
> Yeah, maybe so. This code has been around for a long time and I'm not
> sure what the thought process behind it was, but I don't see a flaw in
> your analysis here.
I think that code is the way it is intentionally: the idea is to not
accept any signals until we reach the explicit "PG_SETMASK(&UnBlockSig);"
call further down, between the sigsetjmp stanza and the main loop.
The sigdelset call, just like the adjacent pqsignal calls, is
preparatory setup; it does not intend to allow anything to happen
immediately.
In general, you don't want to accept signals in that area because the
process state may not be fully set up yet. You could argue that the
SIGQUIT handler has no state dependencies, making it safe to accept
SIGQUIT earlier during startup of one of these processes, and likewise
for them to accept SIGQUIT during error recovery. But barring actual
evidence of a problem with slow SIGQUIT response in these areas I'm more
inclined to leave well enough alone. Changing this would add hazards,
e.g. if somebody ever changes the behavior of the SIGQUIT handler, so
I'd want some concrete evidence of a benefit. It seems fairly
irrelevant to the problem at hand with bgworkers, anyway.
As for said problem, I concur with Robert that the v4 patch seems pretty
dubious; it's adding a lot of barely-thought-out mechanism for no
convincing gain in safety. I think the v1 patch was more nearly the right
thing, except that the unblock needs to happen a tad further down, as
attached (this is untested but certainly it should pass any test that v1
passed). I didn't do anything about rewriting the bogus comment just
above the sigsetjmp call, but I agree that that should happen too.
regards, tom lane
Attachment | Content-Type | Size |
---|---|---|
fix-parallel-worker-hang-v5.patch | text/x-diff | 724 bytes |
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2020-08-07 15:50:55 | Re: recovering from "found xmin ... from before relfrozenxid ..." |
Previous Message | Robert Haas | 2020-08-07 14:44:35 | Re: Handing off SLRU fsyncs to the checkpointer |