Re: Reorder shutdown sequence, to flush pgstats later

From: Andres Freund <andres(at)anarazel(dot)de>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>
Subject: Re: Reorder shutdown sequence, to flush pgstats later
Date: 2025-01-24 20:06:17
Message-ID: ljyhozcxutbva5sx56xdyqr7l5wgb7lvrp6nqhf2dmvqbmbbui@5nui55sscaax
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Attached is a new version of this patchset:

- HandleFatalError now doesn't expect to be called if already in FatalError or
ImmediateShutdown and contains a comment and assertions to that effect

- Fatal errors when in pmState > PM_WAIT_BACKENDS go to PM_WAIT_DEAD_END
directly. The previous version had them go to PM_WAIT_BACKENDS, to save
duplicating one line. The ConfigurePostmasterWaitSet(false) call is now
duplicated, with a comment at both sites mentioning that.

- Typo, comment and commit message fixes/polishing.

Unless somebody argues against, I'm planning to push all but the last later
today, wait for the buildfarm to settle, and then push the last. This is a
dependency of multiple other things, so it'd be good to get it in soon.

On 2025-01-15 08:38:33 +0000, Bertrand Drouvot wrote:
> On Tue, Jan 14, 2025 at 06:55:03PM -0500, Andres Freund wrote:
> > On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote:
> > > === 4
> > >
> > > + * XXX: Is it worth inventing a different PMQUIT value
> > > + * that signals that the cluster is in a bad state,
> > > + * without a process having crashed?
> > > */
> > >
> > > That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.
> >
> > For now I think I'll leave it as such, as we'd need to have a new message in
> > quickdie(). While the primary message for PMQUIT_FOR_CRASH message would
> > differ, the rest seems like it'd largely be duplicated:
> >
> > case PMQUIT_FOR_CRASH:
> > /* A crash-and-restart cycle is in progress */
> > ereport(WARNING_CLIENT_ONLY,
> > (errcode(ERRCODE_CRASH_SHUTDOWN),
> > errmsg("terminating connection because of crash of another server process"),
> > errdetail("The postmaster has commanded this server process to roll back"
> > " the current transaction and exit, because another"
> > " server process exited abnormally and possibly corrupted"
> > " shared memory."),
> > errhint("In a moment you should be able to reconnect to the"
> > " database and repeat your command.")));
> > break;
> >
> > That seems like a separate change. Particularly because I'm just working on
> > this as part of some nasty three layer deep dependency chain - aio is pretty
> > big on its own...
>
> Fair enough. I'll look at the remaining pieces once this stuff goes in.

Cool. I did try writing the change, but it does indeed seem better as a
separate patch. Besides the error message, it also seems we ought to invent a
different ERRCODE, as neither ERRCODE_ADMIN_SHUTDOWN nor
ERRCODE_CRASH_SHUTDOWN seem appropriate.

Vaguely related and not at all crucial:

quickdie(), in the PMQUIT_FOR_CRASH, does
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));

but in case of an immediate *restart* (via pg_ctl restart -m immediate) we'll
not have such hint. postmaster.c doesn't know it's an immediate restart, so
that's not surprising, but it still seems a bit weird. One would hope that
immediate restarts are more common than crashes...

>
> > > But that would also need to call TerminateChildren() (a second time) again after
> > > ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
> > > TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
> > > the first call).
> >
> > Why would it need to?
>
> Because I thought that TerminateChildren() needs to be called after
> ConfigurePostmasterWaitSet(false). If not, could new connections be accepted in
> the window between TerminateChildren() and ConfigurePostmasterWaitSet(false)?

I don't think that can happen with the attached patch - the
ConfigurePostmasterWaitSet() is called before HandleFatalError() returns, so
no new connections can be accepted, as that happens only from
ServerLoop()->AcceptConnection().

> > > === 8
> > >
> > > + * Now that everyone important is gone
> > >
> > > s/everyone important is/walsenders and archiver are also gone/?
> >
> > I'd rather get away from these lists of specific processes - they end up being
> > noisy in the git history because the lines get updated to add/remove items (or
> > updates to them get missed).
>
> Good point, that's more "difficult" to maintain. Though I'm not sure that
> "important" is the right wording. "Now that non dead processes have finished"
> maybe? Could be seen as a Nit too.

This specific case I just solved by referencing "the processes mentioned
above"...

Greetings,

Andres Freund

Attachment Content-Type Size
v5-0001-checkpointer-Request-checkpoint-via-latch-instead.patch text/x-diff 4.8 KB
v5-0002-postmaster-Don-t-open-code-TerminateChildren-in-H.patch text/x-diff 3.5 KB
v5-0003-postmaster-Don-t-repeatedly-transition-to-crashin.patch text/x-diff 2.7 KB
v5-0004-postmaster-Move-code-to-switch-into-FatalError-st.patch text/x-diff 4.4 KB
v5-0005-postmaster-Commonalize-FatalError-paths.patch text/x-diff 4.5 KB
v5-0006-postmaster-Adjust-which-processes-we-expect-to-ha.patch text/x-diff 2.9 KB
v5-0007-Change-shutdown-sequence-to-terminate-checkpointe.patch text/x-diff 18.2 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2025-01-24 20:07:42 Re: Windows: openssl & gssapi dislike each other
Previous Message Melanie Plageman 2025-01-24 20:02:21 Re: Eagerly scan all-visible pages to amortize aggressive vacuum