Re: Archiver not exiting upon crash

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Archiver not exiting upon crash
Date: 2012-05-23 20:10:35
Message-ID: 10932.1337803835@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Janes <jeff(dot)janes(at)gmail(dot)com> writes:
> It looks to me like the SIGQUIT from the postmaster is simply getting
> lost. And from what little I understand of signal handling, this is a
> known race with system(3). The archive_command, child of archiver,
> exits before it can receive the signal sent to the entire archiver
> process group, so it doesn't set its exit status to show it was
> signalled. But the signal sent directly to the archiver reaches it
> while it is still ignoring SIGQUITs.

Ugh.

> If the SIGQUIT is getting lost in a race, could it just be blocked
> during the system(3) call?
> I don't know what happens if you call system(3) with SIGQUIT being blocked.

On my machine, man system(3) saith:

system() ignores the SIGINT and SIGQUIT signals, and blocks the
SIGCHLD signal, while waiting for the command to terminate. If this
might cause the application to miss a signal that would have killed
it, the application should examine the return value from system() and
take whatever action is appropriate to the application if the command
terminated due to receipt of a signal.

Now, the code that directly calls system(), namely pgarch_archiveXlog(),
knows this perfectly well, as per the comment at lines 590ff in HEAD.
However, the code that *calls* it did not get the memo :-(, and appears
to be willing to retry regardless.

> Or maybe the postmaster should not be infinitely patient, but send
> another round of signals after a brief delay.

If the first one was ignored, later ones might be too.

I'm inclined to think that we should change pgarch_archiveXlog to
detect these specific signal conditions and just directly exit(),
rather than giving its caller a chance to blow the decision.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2012-05-23 20:12:26 Re: [RFC] Interface of Row Level Security
Previous Message Kohei KaiGai 2012-05-23 20:04:50 Re: [RFC] Interface of Row Level Security