pgsql: Fix postmaster state machine to handle dead_end child crashes be

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-committers(at)lists(dot)postgresql(dot)org
Subject: pgsql: Fix postmaster state machine to handle dead_end child crashes be
Date: 2019-08-26 19:59:51
Message-ID: E1i2L9b-0003ln-HE@gemulon.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-committers

Fix postmaster state machine to handle dead_end child crashes better.

A report from Alvaro Herrera shows that if we're in PM_STARTUP
state, and we spawn a dead_end child to reject some incoming
connection request, and that child dies with an unexpected exit
code, the postmaster does not respond well. We correctly send
SIGQUIT to the startup process, but then:

* if the startup process exits with nonzero exit code, as expected,
we thought that that indicated a crash and aborted startup.

* if the startup process exits with zero exit code, which is possible
due to the inherent race condition, we'd advance to PM_RUN state
which is fine --- but the code forgot that AbortStartTime would be
nonzero in this situation. We'd either die on the Asserts saying
that it was zero, or perhaps misbehave later on. (A quick look
suggests that the only misbehavior might be busy-waiting due to
DetermineSleepTime doing the wrong thing.)

To fix the first point, adjust the state-machine logic to recognize
that a nonzero exit code is expected after sending SIGQUIT, and have
it transition to a state where we can restart the startup process.
To fix the second point, change the Asserts to clear the variable
rather than just claiming it should be clear already.

Perhaps we could improve this further by not treating a crash of
a dead_end child as a reason for panic'ing the database. However,
since those child processes are connected to shared memory, that
seems a bit risky. There are few good reasons for a dead_end child
to report failure anyway (the cause of this in Alvaro's report is
quite unclear). On balance, therefore, a minimal fix seems best.

This is an oversight in commit 45811be94. While that was back-patched,
I'm hesitant to back-patch this change. The lack of reasons for a
dead_end child to fail suggests that the case should be very rare in
the field, which squares with the lack of reports; so it seems like
this might not be worth the risk of introducing new issues. In any
case we can let it bake awhile in HEAD before considering a back-patch.

Discussion: https://postgr.es/m/20190615160950.GA31378@alvherre.pgsql

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/ee3278239550ff0ec9df72dd798a480c82c2b723

Modified Files
--------------
src/backend/postmaster/postmaster.c | 23 +++++++++++++++++++----
1 file changed, 19 insertions(+), 4 deletions(-)

Browse pgsql-committers by date

  From Date Subject
Next Message Tom Lane 2019-08-26 21:03:08 pgsql: Fix 007_sync_rep.pl to notice failures in ALTER SYSTEM SET.
Previous Message Peter Eisentraut 2019-08-26 19:41:23 Re: pgsql: Fix gettext triggers specification