Re: Refactoring postmaster's code to cleanup after child exit

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring postmaster's code to cleanup after child exit
Date: 2024-08-08 10:47:42
Message-ID: CA+hUKGKWqSFoJXzRC9D0owFj1Xby0qvWr-k6p8z9gQzT7+nr7w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Aug 2, 2024 at 11:57 AM Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
> * v3-0001-Make-BackgroundWorkerList-doubly-linked.patch

LGTM.

> [v3-0002-Refactor-code-to-handle-death-of-a-backend-or-bgw.patch]

Currently, when a child process exits, the postmaster first scans
through BackgroundWorkerList, to see if it the child process was a
background worker. If not found, then it scans through BackendList to
see if it was a regular backend. That leads to some duplication
between the bgworker and regular backend cleanup code, as both have an
entry in the BackendList that needs to be cleaned up in the same way.
Refactor that so that we scan just the BackendList to find the child
process, and if it was a background worker, do the additional
bgworker-specific cleanup in addition to the normal Backend cleanup.

Makes sense.

On Windows, if a child process exits with ERROR_WAIT_NO_CHILDREN, it's
now logged with that exit code, instead of 0. Also, if a bgworker
exits with ERROR_WAIT_NO_CHILDREN, it's now treated as crashed and is
restarted. Previously it was treated as a normal exit.

Interesting. So when that error was first specially handled in this thread:

https://www.postgresql.org/message-id/flat/AANLkTimCTkNKKrHCd3Ot6kAsrSS7SeDpOTcaLsEP7i%2BM%40mail.gmail.com#41f60947571b75377f04af67ba6baf40

... it went from being considered a crash, to being considered like
exit(0). It's true that shared memory can't be corrupted by a process
that never enters main(), but it's better not to hide the true reason
for the failure (if it is still possible -- I don't find many
references to that phenomenon in recent times). Clobbering exitstatus
with 0 doesn't seem right at all, now that we have background workers
whose restart behaviour is affected by that.

If a child process is not found in the BackendList, the log message
now calls it "untracked child process" rather than "server process".
Arguably that should be a PANIC, because we do track all the child
processes in the list, so failing to find a child process is highly
unexpected. But if we want to change that, let's discuss and do that
as a separate commit.

Yeah, it would be highly unexpected if waitpid() told you about some
random other process (or we screwed up the bookkeeping and didn't
recognise it). So at least having a different message seems good.

> * v3-0003-Fix-comment-on-processes-being-kept-over-a-restar.patch

+1

> * v3-0004-Consolidate-postmaster-code-to-launch-background-.patch

Much of the code in process_pm_child_exit() to launch replacement
processes when one exits or when progressing to next postmaster state
was unnecessary, because the ServerLoop will launch any missing
background processes anyway. Remove the redundant code and let
ServerLoop handle it.

+1, makes sense.

In ServerLoop, move the code to launch all the processes to a new
subroutine, to group it all together.

+1, makes sense.

More soon...

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Shlok Kyal 2024-08-08 10:54:22 Re: long-standing data loss bug in initial sync of logical replication
Previous Message Peter Eisentraut 2024-08-08 10:11:38 Re: Enable data checksums by default