Re: Parallel worker hangs while handling errors.

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Parallel worker hangs while handling errors.
Date: 2020-07-17 07:51:26
Message-ID: CALj2ACVBLox7LvCKpWd7k-kubwrmeRFjZRTrnGHN6RZQgG4cxw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

>
> Leader process then identifies that there are some messages that need
> to be processed, it copies the messages and sets the latch so that the
> worker process can copy the remaining message from the below function:
> shm_mq_inc_bytes_read -> SetLatch(&sender->procLatch);, Worker is not
> able to receive any signal at this point of time & hangs infinitely
> Worker hangs in this case because when the worker is started the
> signals will be masked using sigprocmask. Unblocking of signals is
> done by calling BackgroundWorkerUnblockSignals in ParallelWorkerMain.
> Now due to error handling the worker has jumped to setjmp in
> StartBackgroundWorker function. Here the signals are in a blocked
> state, hence the signal is not received by the worker process.
>

Your analysis looks fine to me.

Adding some more info:

The worker uses SIGUSR1 (with a special shared memory flag
PROCSIG_PARALLEL_MESSAGE) both for error message sharing(from
mq_putmessage) and for parallel worker shutdown(from
ParallelWorkerShutdown).

And yes, the postmaster blocks SIGUSR1 before forking bgworker(in
PostmasterMain with pqinitmask() and PG_SETMASK(&BlockSig)), bgworker
receives the same blocked signal mask for itself and enters
StartBackgroundWorker(), uses sigsetjmp for error handling, and then
goes to ParallelWorkerMain() where few of the signal handers are set
and then BackgroundWorkerUnblockSignals() is called to not block any
signals.

But observe when we did sigsetjmp the state of the signal mask is that
of we received from postmaster which is all signals blocked.

So, now in error cases when the control is jumped to sigsetjmp we
still have the signals blocked(along with SIGUSR1) mask and in the
code path of EmitErrorReport, we do send SIGUSR1 with flag
PROCSIG_PARALLEL_MESSAGE to the leader/backend and wait for the latch
to be set, this happens only if the worker is able to receive back
SIGUSR1 from the leader/backend.

In this reported issue, since SIGUSR1 is blocked at sigsetjmp in
StartBackgroundWorker(), there is no way that the worker process
receiving it from the leader and the latch cannot be set and hence the
hang occurs.

The same hang issue can occur(though I'm not able to back it up with a
use case), in the cases from wherever the EmitErrorReport() is called
from "if (sigsetjmp(local_sigjmp_buf, 1) != 0)" block, such as
autovacuum.c, bgwriter.c, bgworker.c, checkpointer.c, walwriter.c, and
postgres.c.

>
> One of the fixes could be to call BackgroundWorkerUnblockSignals just
> after sigsetjmp. I'm not sure if this is the best solution.
> Robert & myself had a discussion about the problem yesterday. We felt
> this is a genuine problem with the parallel worker error handling and
> need to be fixed.
>

Note that, in all sigsetjmp blocks, we intentionally
HOLD_INTERRUPTS(), to not cause any issues while performing error
handling, I'm concerned here that now, if we directly call
BackgroundWorkerUnblockSignals() which will open up all the signals
and our main intention of holding interrupts or block signals may go
away.

Since the main problem for this hang issue is because of blocking
SIGUSR1, in sigsetjmp, can't we just only unblock only the SIGUSR1,
instead of unblocking all signals? I tried this with parallel copy
hang, the issue is resolved.

Something like below -

if (sigsetjmp(local_sigjmp_buf, 1) != 0)
{
sigset_t x;
sigemptyset (&x);
sigaddset(&x, SIGUSR1);
sigprocmask(SIG_UNBLOCK, &x, NULL);

/* Since not using PG_TRY, must reset error stack by hand */
error_context_stack = NULL;

/* Prevent interrupts while cleaning up */
HOLD_INTERRUPTS();

If okay, with the above approach, we can put the above
sigprocmask(SIG_UNBLOCK,..) piece of code(of course generically to
unblock any given signal) in a macro similar to PG_SETMASK() and use
that in all the places wherever EmitErrorReport() is called from
sigsetjmp. We should mind the portability of sigprocmask as well.

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message tsunakawa.takay@fujitsu.com 2020-07-17 08:29:08 RE: Transactions involving multiple postgres foreign servers, take 2
Previous Message Fujii Masao 2020-07-17 07:25:58 Re: Is it useful to record whether plans are generic or custom?