From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | Alvaro Herrera <alvherre(at)commandprompt(dot)com> |
Cc: | Pg Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: checkpointer code behaving strangely on postmaster -T |
Date: | 2012-05-11 21:44:50 |
Message-ID: | 17504.1336772690@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Alvaro Herrera <alvherre(at)commandprompt(dot)com> writes:
> Excerpts from Tom Lane's message of vie may 11 16:50:01 -0400 2012:
>> I'm confused about what you did here and whether this isn't just pilot
>> error.
> The sequence of events is:
> postmaster -T
> crash a backend
> SIGINT postmaster
> SIGCONT all child processes
> My expectation is that postmaster should exit normally after this.
Well, my expectation is that the postmaster should wait for the children
to finish dying, and then exit rather than respawn anything. It is not
on the postmaster's head to make them die anymore, because it already
(thinks it) sent them SIGQUIT. Using SIGCONT here is pilot error.
> Maybe we can consider this to be just pilot error, but then why do all
> other processes exit normally?
The reason for that is that the postmaster's SIGINT interrupt handler
(lines 2163ff) sent them SIGTERM, without bothering to notice that we'd
already sent them SIGQUIT/SIGSTOP; so once you CONT them they get the
SIGTERM and drop out normally. That handler knows it should not signal
the checkpointer yet, so the checkpointer doesn't get the memo. But the
lack of a FatalError check here is just a simplicity of implementation
thing; it should not be necessary to send any more signals once we are
in FatalError state. Besides, this behavior is all wrong for a crash
recovery scenario: there is no guarantee that shared memory is in good
enough condition for SIGTERM shutdown to work. And we *definitely*
don't want the checkpointer trying to write a shutdown checkpoint.
>> So I don't see any bug here. And, after closer inspection, your
>> previous proposed patch is quite bogus because checkpointer is not
>> supposed to stop yet when the other processes are being terminated
>> normally.
> Well, it does send the signal only when FatalError is set. So it should
> only affect -T behavior.
If FatalError is set, it should not be necessary to send any more
signals, period, because we already tried to kill every child. If we
need to defend against somebody using SIGSTOP/SIGCONT inappropriately,
it would take a lot more thought (and code) than this, and it would
still be extremely fragile because a SIGCONT'd backend is going to be
executing against possibly-corrupt shared memory.
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Antonin Houska | 2012-05-11 21:52:39 | WIP: parameterized function scan |
Previous Message | Alvaro Herrera | 2012-05-11 21:19:13 | Re: checkpointer code behaving strangely on postmaster -T |