Re: processes stuck in shutdown following OOM/recovery

From: Noah Misch <noah(at)leadboat(dot)com>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Martijn Wallet <martijn(at)dbre(dot)nu>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: processes stuck in shutdown following OOM/recovery
Date: 2024-10-25 18:17:01
Message-ID: 20241025181701.ed.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At commit 779972e, I got about 50% "pg_ctl: server does not shut down" from
$SUBJECT with this loop:

nti=; while date; do PGCTLTIMEOUT=4 make check-tests TESTS=reindex_catalog PG_TEST_INITDB_EXTRA_OPTS='-cwal_level=minimal -cmax_wal_senders=0' NO_TEMP_INSTALL=$nti; nti=1; grep abnormal src/test/regress/log/postmaster.log; done

Your patch fixes that. (It came to my attention while developing commit
67bab53. At that commit or later, this recipe won't see the problem.) The
patch is a strict improvement, so I would be marking
https://commitfest.postgresql.org/50/4884/ Ready for Committer if it weren't
already in that state. That said, ...

On Thu, May 23, 2024 at 10:29:13AM +1200, Thomas Munro wrote:
> I'm also hoping to get review of the rather finickity state machine
> logic involved from people familiar with that; I think it's right, but
> I'd hate to break some other edge case...

... while I don't see an edge case it makes worse, I would fix the defect
differently, to cover more existing edge cases. I think a good standard is to
align with restart_after_crash=off behaviors. With restart_after_crash=off,
the postmaster exits with "shutting down because \"restart_after_crash\" is
off". Assume an external service manager then restarts it. Any deviations
between that experience and running with restart_after_crash=on should have a
specific reason. So post-OOM recovery should mirror original crash recovery
of a fresh postmaster.

List of behaviors the postmaster FatalError variable controls:

arms "issuing %s to recalcitrant children"
disarms "terminating any other active server processes"
changes CAC_STARTUP to CAC_RECOVERY
makes PM_WAIT_BACKENDS assume checkpointer already got SIGQUIT [causes $SUBJECT hang]
exit(1), as opposed to exit(0)
LOG "abnormal database system shutdown"
"all server processes terminated; reinitializing"
disarm maybe_start_bgworkers

Once we launch a new startup process for post-OOM recovery, I think we want
all of those behaviors to cease. In particular, disarming
maybe_start_bgworkers() and disarming "terminating any other active server
processes" are bad at that point. (And the $SUBJECT hang is bad, of course.)
What do you think? The rest are more cosmetic. (We don't document the exit
status, and pg_ctl ignores it. CAC_STARTUP gives me the most ambivalence, but
its cosmetics don't matter too much.) Disabling all those entails essentially
the following, though it would deserve comment edits and removal of the later
FatalError=false:

--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -3144,8 +3144,9 @@ PostmasterStateMachine(void)
StartupPID = StartChildProcess(B_STARTUP);
Assert(StartupPID != 0);
StartupStatus = STARTUP_RUNNING;
SetPmState(PM_STARTUP);
+ FatalError = false;
/* crash recovery started, reset SIGKILL flag */
AbortStartTime = 0;

Your patch eliminate all hangs for me, but about half the iterations of the
above test command still get exit(1) and the "abnormal" message. That could
be fine. Perhaps that matches what would happen if the fast shutdown arrived
earlier, after the postmaster observed the crash and before the last backend
reacted to SIGQUIT. Still, I'd lean toward clearing FatalError once we launch
the startup process. Again, two other behaviors it controls feel bad at that
point, and none of the behaviors it controls feel clearly-good at that point.

> --- a/src/backend/postmaster/postmaster.c
> +++ b/src/backend/postmaster/postmaster.c
> @@ -3748,12 +3748,14 @@ PostmasterStateMachine(void)
> WalSummarizerPID == 0 &&
> BgWriterPID == 0 &&
> (CheckpointerPID == 0 ||
> - (!FatalError && Shutdown < ImmediateShutdown)) &&
> + (!FatalError && Shutdown < ImmediateShutdown) ||
> + (FatalError && CheckpointerPID != 0)) &&

I think that's equivalent to:

(CheckpointerPID == 0 || FatalError || Shutdown < ImmediateShutdown)

Separable topic: this related comment and code are already out of date:

/*
* Unexpected exit of startup process (including FATAL exit)
* during PM_STARTUP is treated as catastrophic. There are no
* other processes running yet, so we can just exit.
*/
if (pmState == PM_STARTUP &&
StartupStatus != STARTUP_SIGNALED &&
!EXIT_STATUS_0(exitstatus))
{
LogChildExit(LOG, _("startup process"),
pid, exitstatus);
ereport(LOG,
(errmsg("aborting startup due to startup process failure")));
ExitPostmaster(1);
}

There are other children, like the checkpointer. Our normal practice would be
to SIGQUIT them and wait. (These days, the children notice postmaster exit
quickly, so it doesn't come up much).

Thanks,
nm

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2024-10-25 18:22:07 Re: [PoC] Federated Authn/z with OAUTHBEARER
Previous Message Tom Lane 2024-10-25 18:01:23 Re: Forbid to DROP temp tables of other sessions