Re: Reorder shutdown sequence, to flush pgstats later

From: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Reorder shutdown sequence, to flush pgstats later
Date: 2025-01-13 12:20:39
Message-ID: Z4UFFw09vN404Uh+@ip-10-97-1-34.eu-west-3.compute.internal
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On Fri, Jan 10, 2025 at 02:07:25PM -0500, Andres Freund wrote:
> However, I think 0007 needs a bit more work.
>
> One thing that worries me a bit is that using SIGINT for triggering the
> shutdown checkpoint could somehow be unintentionally triggered? Previously
> checkpointer would effectively have ignored that,

Right.

> but now it'd bring the whole
> cluster down (because postmaster would see an unexpected ShutdownXLOG()). But
> I can't really see a legitimate reason for checkpointer to get spurious
> SIGINTs.

Yeah, appart from human error on the OS side, I don't see any reasons too.

> This made me think about what happens if a spurious SIGINT is sent - and in
> the last version the answer wasn't great, namely everything seemed to go on
> normal, except that the cluster couldn't be shut down normally.

Yeah, and I think that checkpoints would hang.

> There wasn't
> any code to treat receiving PMSIGNAL_XLOG_IS_SHUTDOWN in the wrong state as
> fatal. I think this needs to be treated the same way we treat not being able
> to fork checkpointer during a shutdown. Now receiving
> PMSIGNAL_XLOG_IS_SHUTDOWN while not in PM_WAIT_XLOG_SHUTDOWN triggers
> FatalError processing after logging "WAL was shut down unexpectedly".

I wonder if we could first ask the postmaster a confirmation that the SIGINT is
coming from it? (and if not, then simply ignore the SIGINT). I thought about a
shared memory flag but the postmaster has to be keept away from shared memory
operations, so that's not a valid idea. Another idea could be that the checkpointer
sends a new "confirm" request to the postmater first. But then I'm not sure how
the postmaster could reply back (given the signals that are already being used)
and that might overcomplicate things.

So yeah, might be better to be on the safe side of thing and "simply" enter Fatal.

> We have
> multiple copies of code to go to FatalError, that doesn't seem great.

+ * FIXME: This should probably not have a copy fo the code to
+ * transition into FatalError state.
+ */
+ ereport(LOG,
+ (errmsg("WAL was shut down unexpectedly")));

Indeed, might be worth extracting this into a helper function or such.

> Another thing I started worrying about is that the commit added
> PM_WAIT_CHECKPOINTER *after* PM_WAIT_DEAD_END. However, we have a few places
> where we directly jump to PM_WAIT_DEAD_END in fatal situations:
>
> /*
> * Stop any dead-end children and stop creating new ones.
> */
> UpdatePMState(PM_WAIT_DEAD_END);
> ConfigurePostmasterWaitSet(false);
> SignalChildren(SIGQUIT, btmask(B_DEAD_END_BACKEND));
> and
> /*
> * If we failed to fork a checkpointer, just shut down.
> * Any required cleanup will happen at next restart. We
> * set FatalError so that an "abnormal shutdown" message
> * gets logged when we exit.
> *
> * We don't consult send_abort_for_crash here, as it's
> * unlikely that dumping cores would illuminate the reason
> * for checkpointer fork failure.
> */
> FatalError = true;
> UpdatePMState(PM_WAIT_DEAD_END);
> ConfigurePostmasterWaitSet(false);
>
>
> I don't think this actively causes problems today,

For the first case, we'd ask the checkpointer to do work in the PM_WAIT_DEAD_END
case while already SIGQUIT'd (through SignalHandlerForCrashExit()). But that's
not an issue per say as we check for CheckpointerPMChild != NULL when
pmState == PM_WAIT_DEAD_END.

For the second case (where we failed to fork a checkpointer), we'd also
check for CheckpointerPMChild != NULL when pmState == PM_WAIT_DEAD_END.

So in both case we'd ask a non existing checkpointer to do work but we later
check that it exists, so I also don't think that it causes any problem.

> but it seems rather iffy.

Yeah, we intend to ask a non existing checkpointer process to do work...

> Maybe PM_WAIT_CHECKPOINTER should be before PM_WAIT_DEAD_END? I earlier
> thought it'd be better to shut checkpointer down after even dead-end children
> exited, in case we ever wanted to introduce stats for dead-end children - but
> that seems rather unlikely?

Yeah, given the above, it looks more clean to change that ordering.

Indeed that seems unlikely to introduce dead-end children stats and even if we do
, we could think back to put PM_WAIT_CHECKPOINTER after PM_WAIT_DEAD_END (or any
oher way to ensure the stats are flushed after they are shut down). So it's
probably better to focus on the current state and then put
PM_WAIT_CHECKPOINTER before PM_WAIT_DEAD_END.

Anyway, if we reach the 2 cases mentioned above we're not going to flush the
stats anyway (CheckpointerPMChild is NULL) so better to reorder then.

> Independent of this patch, we seem to deal with FatalError processing in a
> somewhat inconsistently. We have this comment (in master):
> /*
> * PM_WAIT_DEAD_END state ends when all other children are gone except
> * for the logger. During normal shutdown, all that remains are
> * dead-end backends, but in FatalError processing we jump straight
> * here with more processes remaining. Note that they have already
> * been sent appropriate shutdown signals, either during a normal
> * state transition leading up to PM_WAIT_DEAD_END, or during
> * FatalError processing.
>
> However that doesn't actually appear to be true in the most common way to
> reach FatalError, via HandleChildCrash():
>
> if (Shutdown != ImmediateShutdown)
> FatalError = true;
>
> /* We now transit into a state of waiting for children to die */
> if (pmState == PM_RECOVERY ||
> pmState == PM_HOT_STANDBY ||
> pmState == PM_RUN ||
> pmState == PM_STOP_BACKENDS ||
> pmState == PM_WAIT_XLOG_SHUTDOWN)
> UpdatePMState(PM_WAIT_BACKENDS);
>
> Here we
> a) don't directly go to PM_WAIT_DEAD_END, but PM_WAIT_BACKENDS
>
> From PM_WAIT_BACKENDS we'll switch to PM_WAIT_DEAD_END if all processes
> other than walsender, archiver and dead-end children exited.
>
> b) in PM_WAIT_XLOG_SHUTDOWN (previously named PM_SHUTDOWN) we go *back* to
> PM_WAIT_BACKENDS?
>
> This comment seems to have been added in
>
> commit a78af0427015449269fb7d9d8c6057cfcb740149
> Author: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
> Date: 2024-11-14 16:12:28 +0200
>
> Assign a child slot to every postmaster child process
>
> ISTM that section of the comment is largely wrong and has never really been
> the case if my git sleuthing is correct?

I agree that does not look correct but I don't think it's coming from a78af0427015449269fb7d9d8c6057cfcb740149.
ISTM that a78af0427015449269fb7d9d8c6057cfcb740149 re-worded this already wrong
comment:

"
/*
* PM_WAIT_DEAD_END state ends when the BackendList is entirely empty
* (ie, no dead_end children remain), and the archiver is gone too.
*
* The reason we wait for those two is to protect them against a new
* postmaster starting conflicting subprocesses; this isn't an
* ironclad protection, but it at least helps in the
* shutdown-and-immediately-restart scenario. Note that they have
* already been sent appropriate shutdown signals, either during a
* normal state transition leading up to PM_WAIT_DEAD_END, or during
* FatalError processing.
*/
"

while we could also see:

"
if (Shutdown != ImmediateShutdown)
FatalError = true;

/* We now transit into a state of waiting for children to die */
if (pmState == PM_RECOVERY ||
pmState == PM_HOT_STANDBY ||
pmState == PM_RUN ||
pmState == PM_STOP_BACKENDS ||
pmState == PM_SHUTDOWN)
pmState = PM_WAIT_BACKENDS;
"

It seems to me that it has been introduced in e6a442c71b30f62e7b5eee6058afc961b1c7f29b
where the above comment has been added and does not align with what HandleChildCrash()
was already doing (was already setting PM_WAIT_BACKENDS).

But anyway I do agree that this comment look wrong.

> We do have one place that directly switches to PM_WAIT_DEAD_END:
> /*
> * If we failed to fork a checkpointer, just shut down.
> * Any required cleanup will happen at next restart. We
> * set FatalError so that an "abnormal shutdown" message
> * gets logged when we exit.
> *
> * We don't consult send_abort_for_crash here, as it's
> * unlikely that dumping cores would illuminate the reason
> * for checkpointer fork failure.
> */
> FatalError = true;
> UpdatePMState(PM_WAIT_DEAD_END);
> ConfigurePostmasterWaitSet(false);
>
> but I suspect that's just about unreachable these days. If checkpointer exits
> outside of shutdown processing, it's treated as as a reason to crash-restart:
> /*
> * Any unexpected exit of the checkpointer (including FATAL
> * exit) is treated as a crash.
> */
> HandleChildCrash(pid, exitstatus,
> _("checkpointer process"));
>

I do agree. I was trying to reach this code path while replying to one of
the above comments (about PM_WAIT_DEAD_END ordering) but had to replace with
another if condition to reach it during testing.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nisha Moond 2025-01-13 12:21:11 Re: Conflict detection for update_deleted in logical replication
Previous Message Alena Rybakina 2025-01-13 11:29:31 Re: Vacuum statistics