Re: Refactoring postmaster's code to cleanup after child exit

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring postmaster's code to cleanup after child exit
Date: 2024-09-06 13:13:43
Message-ID: d9269114-fb59-400a-ae32-3240d1acf61f@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 04/09/2024 17:35, Andres Freund wrote:
> On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
>> From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
>> From: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
>> Date: Mon, 12 Aug 2024 10:59:04 +0300
>> Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children
>>
>> And replace postmaster.c's own "backend type" codes with BackendType
>
> Hm - it seems a bit odd to open-code this when we actually have a "table
> driven configuration" available? Why isn't the type a field in
> child_process_kind?

Sorry, I didn't understand this. What exactly would you add to
child_process_kind? Where would you use it?

>> +/*
>> + * MaxLivePostmasterChildren
>> + *
>> + * This reports the number postmaster child processes that can be active. It
>> + * includes all children except for dead_end children. This allows the array
>> + * in shared memory (PMChildFlags) to have a fixed maximum size.
>> + */
>> +int
>> +MaxLivePostmasterChildren(void)
>> +{
>> + int n = 0;
>> +
>> + /* We know exactly how mamy worker and aux processes can be active */
>> + n += autovacuum_max_workers;
>> + n += max_worker_processes;
>> + n += NUM_AUXILIARY_PROCS;
>> +
>> + /*
>> + * We allow more connections here than we can have backends because some
>> + * might still be authenticating; they might fail auth, or some existing
>> + * backend might exit before the auth cycle is completed. The exact
>> + * MaxBackends limit is enforced when a new backend tries to join the
>> + * shared-inval backend array.
>> + */
>> + n += 2 * (MaxConnections + max_wal_senders);
>> +
>> + return n;
>> +}
>
> I wonder if we could instead maintain at least some of this in
> child_process_kinds? Manually listing different types of processes in
> different places doesn't seem particularly sustainable.

Hmm, you mean adding "max this kind of children" field to
child_process_kinds? Perhaps.

>
>> +/*
>> + * Initialize at postmaster startup
>> + */
>> +void
>> +InitPostmasterChildSlots(void)
>> +{
>> + int num_pmchild_slots;
>> + int slotno;
>> + PMChild *slots;
>> +
>> + dlist_init(&freeBackendList);
>> + dlist_init(&freeAutoVacWorkerList);
>> + dlist_init(&freeBgWorkerList);
>> + dlist_init(&freeAuxList);
>> + dlist_init(&ActiveChildList);
>> +
>> + num_pmchild_slots = MaxLivePostmasterChildren();
>> +
>> + slots = palloc(num_pmchild_slots * sizeof(PMChild));
>> +
>> + slotno = 0;
>> + for (int i = 0; i < 2 * (MaxConnections + max_wal_senders); i++)
>> + {
>> + init_slot(&slots[slotno], slotno, &freeBackendList);
>> + slotno++;
>> + }
>> + for (int i = 0; i < autovacuum_max_workers; i++)
>> + {
>> + init_slot(&slots[slotno], slotno, &freeAutoVacWorkerList);
>> + slotno++;
>> + }
>> + for (int i = 0; i < max_worker_processes; i++)
>> + {
>> + init_slot(&slots[slotno], slotno, &freeBgWorkerList);
>> + slotno++;
>> + }
>> + for (int i = 0; i < NUM_AUXILIARY_PROCS; i++)
>> + {
>> + init_slot(&slots[slotno], slotno, &freeAuxList);
>> + slotno++;
>> + }
>> + Assert(slotno == num_pmchild_slots);
>> +}
>
> Along the same vein - could we generalize this into one array of different
> slot types and then loop over that to initialize / acquire the slots?

Makes sense.

>> +/* Return the appropriate free-list for the given backend type */
>> +static dlist_head *
>> +GetFreeList(BackendType btype)
>> +{
>> + switch (btype)
>> + {
>> + case B_BACKEND:
>> + case B_BG_WORKER:
>> + case B_WAL_SENDER:
>> + case B_SLOTSYNC_WORKER:
>> + return &freeBackendList;
>
> Maybe a daft question - but why are all of these in the same list? Sure,
> they're essentially all full backends, but they're covered by different GUCs?

No reason. No particular reason they should *not* share the same list
either though.

>
>> + /*
>> + * Auxiliary processes. There can be only one of each of these
>> + * running at a time.
>> + */
>> + case B_AUTOVAC_LAUNCHER:
>> + case B_ARCHIVER:
>> + case B_BG_WRITER:
>> + case B_CHECKPOINTER:
>> + case B_STARTUP:
>> + case B_WAL_RECEIVER:
>> + case B_WAL_SUMMARIZER:
>> + case B_WAL_WRITER:
>> + return &freeAuxList;
>> +
>> + /*
>> + * Logger is not connected to shared memory, and does not have a
>> + * PGPROC entry, but we still allocate a child slot for it.
>> + */
>
> Tangential: Why do we need a freelist for these and why do we choose a random
> pgproc for these instead of assigning one statically?
>
> Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
> of buffers that can't be done in-place, like writes when checksums are
> enabled). The varying proc numbers make that harder than it'd have to be...

Yeah, we can make these fixed.Currently, the # of slots reserved for aux
processes is sized by NUM_AUXILIARY_PROCS, which is one smaller than the
number of different aux proces kinds:

> /*
> * We set aside some extra PGPROC structures for auxiliary processes,
> * ie things that aren't full-fledged backends but need shmem access.
> *
> * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
> * run during normal operation. Startup process and WAL receiver also consume
> * 2 slots, but WAL writer is launched only after startup has exited, so we
> * only need 6 slots.
> */
> #define NUM_AUXILIARY_PROCS 6

For PMChildSlot numbers, we could certainly just allocate one more slot.

It would probably make sense for PGPROCs too, even though PGPROC is a
much larger struct.

>> +PMChild *
>> +AssignPostmasterChildSlot(BackendType btype)
>> +{
>> + dlist_head *freelist;
>> + PMChild *pmchild;
>> +
>> + freelist = GetFreeList(btype);
>> +
>> + if (dlist_is_empty(freelist))
>> + return NULL;
>> +
>> + pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
>> + pmchild->pid = 0;
>> + pmchild->bkend_type = btype;
>> + pmchild->rw = NULL;
>> + pmchild->bgworker_notify = true;
>> +
>> + /*
>> + * pmchild->child_slot for each entry was initialized when the array of
>> + * slots was allocated.
>> + */
>> +
>> + dlist_push_head(&ActiveChildList, &pmchild->elem);
>> +
>> + ReservePostmasterChildSlot(pmchild->child_slot);
>> +
>> + /* FIXME: find a more elegant way to pass this */
>> + MyPMChildSlot = pmchild->child_slot;
>
> What if we assigned one offset for each process and assigned its ID here and
> also used that for its ProcNumber - that way we wouldn't need to manage
> freelists in two places.

It's currently possible to have up to 2 * max_connections backends in
the authentication phase. We would have to change that behaviour, or
make the PGPROC array 2x larger.

It might well be worth it, I don't know how sensible the current
behaviour is. But I'd like to punt that to later patch, to keep the
scope of this patch set reasonable. It's pretty straightforward to do
later on top of this if we want to.

>> +PMChild *
>> +FindPostmasterChildByPid(int pid)
>> +{
>> + dlist_iter iter;
>> +
>> + dlist_foreach(iter, &ActiveChildList)
>> + {
>> + PMChild *bp = dlist_container(PMChild, elem, iter.cur);
>> +
>> + if (bp->pid == pid)
>> + return bp;
>> + }
>> + return NULL;
>> +}
>
> It's not new, but it's quite sad that postmaster's process exit handling is
> effectively O(Backends^2)...

It would be straightforward to turn ActiveChildList into a hash table.
But I'd like to leave that to a followup patch too.

>> /*
>> * We're ready to rock and roll...
>> */
>> - StartupPID = StartChildProcess(B_STARTUP);
>> - Assert(StartupPID != 0);
>> + StartupPMChild = StartChildProcess(B_STARTUP);
>> + Assert(StartupPMChild != NULL);
>
> This (not new) assertion is ... odd.

Yeah, it's an assertion because StartChildProcess has this:

> /*
> * fork failure is fatal during startup, but there's no need to choke
> * immediately if starting other child types fails.
> */
> if (type == B_STARTUP)
> ExitPostmaster(1);

>> @@ -1961,26 +1915,6 @@ process_pm_reload_request(void)
>> (errmsg("received SIGHUP, reloading configuration files")));
>> ProcessConfigFile(PGC_SIGHUP);
>> SignalSomeChildren(SIGHUP, BACKEND_TYPE_ALL & ~(1 << B_DEAD_END_BACKEND));
>> - if (StartupPID != 0)
>> - signal_child(StartupPID, SIGHUP);
>> - if (BgWriterPID != 0)
>> - signal_child(BgWriterPID, SIGHUP);
>> - if (CheckpointerPID != 0)
>> - signal_child(CheckpointerPID, SIGHUP);
>> - if (WalWriterPID != 0)
>> - signal_child(WalWriterPID, SIGHUP);
>> - if (WalReceiverPID != 0)
>> - signal_child(WalReceiverPID, SIGHUP);
>> - if (WalSummarizerPID != 0)
>> - signal_child(WalSummarizerPID, SIGHUP);
>> - if (AutoVacPID != 0)
>> - signal_child(AutoVacPID, SIGHUP);
>> - if (PgArchPID != 0)
>> - signal_child(PgArchPID, SIGHUP);
>> - if (SysLoggerPID != 0)
>> - signal_child(SysLoggerPID, SIGHUP);
>> - if (SlotSyncWorkerPID != 0)
>> - signal_child(SlotSyncWorkerPID, SIGHUP);
>>
>> /* Reload authentication config files too */
>> if (!load_hba())
>
> For a moment I wondered why this change was part of this commit - but I guess
> we didn't have any of these in an array/list before this change...

Correct.

>> @@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
>> }
>>
>> /* Was it the system logger? If so, try to start a new one */
>> - if (pid == SysLoggerPID)
>> + if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
>> {
>> - SysLoggerPID = 0;
>> /* for safety's sake, launch new logger *first* */
>> - SysLoggerPID = SysLogger_Start();
>> + SysLoggerPMChild->pid = SysLogger_Start();
>> + if (SysLoggerPMChild->pid == 0)
>> + {
>> + FreePostmasterChildSlot(SysLoggerPMChild);
>> + SysLoggerPMChild = NULL;
>> + }
>> if (!EXIT_STATUS_0(exitstatus))
>> LogChildExit(LOG, _("system logger process"),
>
> Seems a bit weird to have one place with a different memory lifetime handling
> than other places. Why don't we just do this the same way as in other places
> but continue to defer the logging until after we tried to start the new
> logger?

Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart?

I'm a little scared of changing the existing logic. We don't have a
mechanism for deferring logging, so we would have to invent that, or the
logs would just accumulate in the pipe until syslogger starts up.
There's some code between here and LaunchMissingBackgroundProcesses(),
so might postmaster get blocked between writing to the syslogger pipe,
before having restarted it?

If forking the syslogger process fails, that can happen anyway, though.

> Might be worth having a test ensuring that loggers restart OK.

Yeah..

>> /* Construct a process name for log message */
>> +
>> + /*
>> + * FIXME: use GetBackendTypeDesc here? How does the localization of that
>> + * work?
>> + */
>> if (bp->bkend_type == B_DEAD_END_BACKEND)
>> {
>> procname = _("dead end backend");
>
> Might be worth having a version of GetBackendTypeDesc() that returns a
> translated string?

Constructing the string for background workers is a little more complicated:

snprintf(namebuf, MAXPGPATH, _("background worker \"%s\""),
bp->rw->rw_worker.bgw_type);

We could still do that for background workers and use the transalated
variant of GetBackendTypeDesc() for everything else though.

>> @@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
>> {
>> dlist_iter iter;
>>
>> - dlist_foreach(iter, &BackendList)
>> + dlist_foreach(iter, &ActiveChildList)
>> {
>> - Backend *bp = dlist_container(Backend, elem, iter.cur);
>> + PMChild *bp = dlist_container(PMChild, elem, iter.cur);
>> +
>> + /* We do NOT restart the syslogger */
>> + if (bp == SysLoggerPMChild)
>> + continue;
>
> That comment seems a bit misleading - we do restart syslogger, we just don't
> do it here, no? I realize it's an old comment, but it still seems like it's
> worth fixing given that you touch all the code here...

No, we really do not restart the syslogger. This code runs when
*another* process has crashed unexpectedly. We kill all other processes,
reinitialize shared memory and restart, but the old syslogger keeps
running through all that.

I'll add a note on that to InitPostmasterChildSlots(), as it's a bit
surprising.

>> @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
>> <snip>
>> - if (WalSummarizerPID != 0)
>> - signal_child(WalSummarizerPID, SIGTERM);
>> - if (SlotSyncWorkerPID != 0)
>> - signal_child(SlotSyncWorkerPID, SIGTERM);
>> + targetMask |= (1 << B_STARTUP);
>> + targetMask |= (1 << B_WAL_RECEIVER);
>> +
>> + targetMask |= (1 << B_WAL_SUMMARIZER);
>> + targetMask |= (1 << B_SLOTSYNC_WORKER);
>> /* checkpointer, archiver, stats, and syslogger may continue for now */
>>
>> + SignalSomeChildren(SIGTERM, targetMask);
>> +
>> /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
>> pmState = PM_WAIT_BACKENDS;
>> <snip>
>
> It's likely the right thing to not do as one patch, but IMO this really wants
> to be a state table. Perhaps as part of child_process_kinds, perhaps separate
> from that.

Yeah. I've tried to refactor this into a table before, but didn't come
up with anything that I was happy with. I also feel there must be a
better way to organize this, but not sure what exactly. I hope that will
become more apparent after these other changes.

>> @@ -3130,8 +3047,21 @@ static void
>> LaunchMissingBackgroundProcesses(void)
>> {
>> /* Syslogger is active in all states */
>> - if (SysLoggerPID == 0 && Logging_collector)
>> - SysLoggerPID = SysLogger_Start();
>> + if (SysLoggerPMChild == NULL && Logging_collector)
>> + {
>> + SysLoggerPMChild = AssignPostmasterChildSlot(B_LOGGER);
>> + if (!SysLoggerPMChild)
>> + elog(LOG, "no postmaster child slot available for syslogger");
>
> How could this elog() be reached? Seems something seriously would have gone
> wrong to get here - in which case a LOG that might not even be visible (due to
> logger not working) doesn't seem like the right response.

I'll turn it into an assertion or PANIC.

>> @@ -3334,29 +3270,12 @@ SignalSomeChildren(int signal, uint32 targetMask)
>> static void
>> TerminateChildren(int signal)
>> {
>
> The comment for TerminateChildren() says "except syslogger and dead_end
> backends." - aren't you including the latter here?

The comment is adjusted in
v4-0004-Introduce-a-separate-BackendType-for-dead-end-chi.patch. Before
that, SignalChildren() does ignore dead-end children.

Thanks for the review!

--
Heikki Linnakangas
Neon (https://neon.tech)

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2024-09-06 13:32:04 Re: Refactoring postmaster's code to cleanup after child exit
Previous Message Dean Rasheed 2024-09-06 11:58:10 Re: gamma() and lgamma() functions