From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com> |
Cc: | Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, pgsql-hackers(at)postgresql(dot)org, Nazir Bilal Yavuz <byavuz81(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com> |
Subject: | Re: Reorder shutdown sequence, to flush pgstats later |
Date: | 2025-01-24 20:06:17 |
Message-ID: | ljyhozcxutbva5sx56xdyqr7l5wgb7lvrp6nqhf2dmvqbmbbui@5nui55sscaax |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
Attached is a new version of this patchset:
- HandleFatalError now doesn't expect to be called if already in FatalError or
ImmediateShutdown and contains a comment and assertions to that effect
- Fatal errors when in pmState > PM_WAIT_BACKENDS go to PM_WAIT_DEAD_END
directly. The previous version had them go to PM_WAIT_BACKENDS, to save
duplicating one line. The ConfigurePostmasterWaitSet(false) call is now
duplicated, with a comment at both sites mentioning that.
- Typo, comment and commit message fixes/polishing.
Unless somebody argues against, I'm planning to push all but the last later
today, wait for the buildfarm to settle, and then push the last. This is a
dependency of multiple other things, so it'd be good to get it in soon.
On 2025-01-15 08:38:33 +0000, Bertrand Drouvot wrote:
> On Tue, Jan 14, 2025 at 06:55:03PM -0500, Andres Freund wrote:
> > On 2025-01-14 13:06:31 +0000, Bertrand Drouvot wrote:
> > > === 4
> > >
> > > + * XXX: Is it worth inventing a different PMQUIT value
> > > + * that signals that the cluster is in a bad state,
> > > + * without a process having crashed?
> > > */
> > >
> > > That would be more "accurate". Something like PMQUIT_FORK_FAILURE or such.
> >
> > For now I think I'll leave it as such, as we'd need to have a new message in
> > quickdie(). While the primary message for PMQUIT_FOR_CRASH message would
> > differ, the rest seems like it'd largely be duplicated:
> >
> > case PMQUIT_FOR_CRASH:
> > /* A crash-and-restart cycle is in progress */
> > ereport(WARNING_CLIENT_ONLY,
> > (errcode(ERRCODE_CRASH_SHUTDOWN),
> > errmsg("terminating connection because of crash of another server process"),
> > errdetail("The postmaster has commanded this server process to roll back"
> > " the current transaction and exit, because another"
> > " server process exited abnormally and possibly corrupted"
> > " shared memory."),
> > errhint("In a moment you should be able to reconnect to the"
> > " database and repeat your command.")));
> > break;
> >
> > That seems like a separate change. Particularly because I'm just working on
> > this as part of some nasty three layer deep dependency chain - aio is pretty
> > big on its own...
>
> Fair enough. I'll look at the remaining pieces once this stuff goes in.
Cool. I did try writing the change, but it does indeed seem better as a
separate patch. Besides the error message, it also seems we ought to invent a
different ERRCODE, as neither ERRCODE_ADMIN_SHUTDOWN nor
ERRCODE_CRASH_SHUTDOWN seem appropriate.
Vaguely related and not at all crucial:
quickdie(), in the PMQUIT_FOR_CRASH, does
errhint("In a moment you should be able to reconnect to the"
" database and repeat your command.")));
but in case of an immediate *restart* (via pg_ctl restart -m immediate) we'll
not have such hint. postmaster.c doesn't know it's an immediate restart, so
that's not surprising, but it still seems a bit weird. One would hope that
immediate restarts are more common than crashes...
>
> > > But that would also need to call TerminateChildren() (a second time) again after
> > > ConfigurePostmasterWaitSet() which is not ideal though (unless we move the
> > > TerminateChildren() calls in the switch or add a check on PM_WAIT_DEAD_END in
> > > the first call).
> >
> > Why would it need to?
>
> Because I thought that TerminateChildren() needs to be called after
> ConfigurePostmasterWaitSet(false). If not, could new connections be accepted in
> the window between TerminateChildren() and ConfigurePostmasterWaitSet(false)?
I don't think that can happen with the attached patch - the
ConfigurePostmasterWaitSet() is called before HandleFatalError() returns, so
no new connections can be accepted, as that happens only from
ServerLoop()->AcceptConnection().
> > > === 8
> > >
> > > + * Now that everyone important is gone
> > >
> > > s/everyone important is/walsenders and archiver are also gone/?
> >
> > I'd rather get away from these lists of specific processes - they end up being
> > noisy in the git history because the lines get updated to add/remove items (or
> > updates to them get missed).
>
> Good point, that's more "difficult" to maintain. Though I'm not sure that
> "important" is the right wording. "Now that non dead processes have finished"
> maybe? Could be seen as a Nit too.
This specific case I just solved by referencing "the processes mentioned
above"...
Greetings,
Andres Freund
Attachment | Content-Type | Size |
---|---|---|
v5-0001-checkpointer-Request-checkpoint-via-latch-instead.patch | text/x-diff | 4.8 KB |
v5-0002-postmaster-Don-t-open-code-TerminateChildren-in-H.patch | text/x-diff | 3.5 KB |
v5-0003-postmaster-Don-t-repeatedly-transition-to-crashin.patch | text/x-diff | 2.7 KB |
v5-0004-postmaster-Move-code-to-switch-into-FatalError-st.patch | text/x-diff | 4.4 KB |
v5-0005-postmaster-Commonalize-FatalError-paths.patch | text/x-diff | 4.5 KB |
v5-0006-postmaster-Adjust-which-processes-we-expect-to-ha.patch | text/x-diff | 2.9 KB |
v5-0007-Change-shutdown-sequence-to-terminate-checkpointe.patch | text/x-diff | 18.2 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andrew Dunstan | 2025-01-24 20:07:42 | Re: Windows: openssl & gssapi dislike each other |
Previous Message | Melanie Plageman | 2025-01-24 20:02:21 | Re: Eagerly scan all-visible pages to amortize aggressive vacuum |