Re: Refactoring postmaster's code to cleanup after child exit

From: Andres Freund <andres(at)anarazel(dot)de>
To: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Refactoring postmaster's code to cleanup after child exit
Date: 2024-09-10 16:59:03
Message-ID: nkso6sfhz2zcn4rdwue6ozadgjpptrcwqnoqv5pd2oqfg6gag3@55nqt46wtwjx
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-09-06 16:13:43 +0300, Heikki Linnakangas wrote:
> On 04/09/2024 17:35, Andres Freund wrote:
> > On 2024-08-12 12:55:00 +0300, Heikki Linnakangas wrote:
> > > From dc53f89edbeec99f8633def8aa5f47cd98e7a150 Mon Sep 17 00:00:00 2001
> > > From: Heikki Linnakangas <heikki(dot)linnakangas(at)iki(dot)fi>
> > > Date: Mon, 12 Aug 2024 10:59:04 +0300
> > > Subject: [PATCH v4 4/8] Introduce a separate BackendType for dead-end children
> > >
> > > And replace postmaster.c's own "backend type" codes with BackendType
> >
> > Hm - it seems a bit odd to open-code this when we actually have a "table
> > driven configuration" available? Why isn't the type a field in
> > child_process_kind?
>
> Sorry, I didn't understand this. What exactly would you add to
> child_process_kind? Where would you use it?

I'm not entirely sure what I was thinking of. It might be partially triggering
a prior complaint I had about manually assigning things to MyBackendType,
despite actually having all the information already.

One thing that I just noticed is that this patch orphans comment references to
BACKEND_TYPE_AUTOVAC and BACKEND_TYPE_BGWORKER.

Seems a tad odd to have BACKEND_TYPE_ALL after removing everything else from
the BACKEND_TYPE_* "namespace".

To deal with the issue around bitmasks you had mentioned, I think we should at
least have a static inline function to convert B_* values to the bitmask
index.

> > > +/*
> > > + * MaxLivePostmasterChildren
> > > + *
> > > + * This reports the number postmaster child processes that can be active. It
> > > + * includes all children except for dead_end children. This allows the array
> > > + * in shared memory (PMChildFlags) to have a fixed maximum size.
> > > + */
> > > +int
> > > +MaxLivePostmasterChildren(void)
> > > +{
> > > + int n = 0;
> > > +
> > > + /* We know exactly how mamy worker and aux processes can be active */
> > > + n += autovacuum_max_workers;
> > > + n += max_worker_processes;
> > > + n += NUM_AUXILIARY_PROCS;
> > > +
> > > + /*
> > > + * We allow more connections here than we can have backends because some
> > > + * might still be authenticating; they might fail auth, or some existing
> > > + * backend might exit before the auth cycle is completed. The exact
> > > + * MaxBackends limit is enforced when a new backend tries to join the
> > > + * shared-inval backend array.
> > > + */
> > > + n += 2 * (MaxConnections + max_wal_senders);
> > > +
> > > + return n;
> > > +}
> >
> > I wonder if we could instead maintain at least some of this in
> > child_process_kinds? Manually listing different types of processes in
> > different places doesn't seem particularly sustainable.
>
> Hmm, you mean adding "max this kind of children" field to
> child_process_kinds? Perhaps.

Yep, that's what I meant.

> > > +/* Return the appropriate free-list for the given backend type */
> > > +static dlist_head *
> > > +GetFreeList(BackendType btype)
> > > +{
> > > + switch (btype)
> > > + {
> > > + case B_BACKEND:
> > > + case B_BG_WORKER:
> > > + case B_WAL_SENDER:
> > > + case B_SLOTSYNC_WORKER:
> > > + return &freeBackendList;
> >
> > Maybe a daft question - but why are all of these in the same list? Sure,
> > they're essentially all full backends, but they're covered by different GUCs?
>
> No reason. No particular reason they should *not* share the same list either
> though.

Aren't they controlled by distinct connection limits? Isn't there a danger
that we could use up entries and fail connections due to that, despite not
actually being above the limit?

> > Tangential: Why do we need a freelist for these and why do we choose a random
> > pgproc for these instead of assigning one statically?
> >
> > Background: I'd like to not provide AIO workers with "bounce buffers" (for IO
> > of buffers that can't be done in-place, like writes when checksums are
> > enabled). The varying proc numbers make that harder than it'd have to be...
>
> Yeah, we can make these fixed.

Cool.

> Currently, the # of slots reserved for aux processes is sized by
> NUM_AUXILIARY_PROCS, which is one smaller than the number of different aux
> proces kinds:

> > /*
> > * We set aside some extra PGPROC structures for auxiliary processes,
> > * ie things that aren't full-fledged backends but need shmem access.
> > *
> > * Background writer, checkpointer, WAL writer, WAL summarizer, and archiver
> > * run during normal operation. Startup process and WAL receiver also consume
> > * 2 slots, but WAL writer is launched only after startup has exited, so we
> > * only need 6 slots.
> > */
> > #define NUM_AUXILIARY_PROCS 6
>
> For PMChildSlot numbers, we could certainly just allocate one more slot.
>
> It would probably make sense for PGPROCs too, even though PGPROC is a much
> larger struct.

I don't think it's worth worrying about that much. PGPROC is large, but not
*that* large. And the robustness win of properly detecting when there's a
problem around starting/stopping aux workers seems to outweigh that to me.

> > > +PMChild *
> > > +AssignPostmasterChildSlot(BackendType btype)
> > > +{
> > > + dlist_head *freelist;
> > > + PMChild *pmchild;
> > > +
> > > + freelist = GetFreeList(btype);
> > > +
> > > + if (dlist_is_empty(freelist))
> > > + return NULL;
> > > +
> > > + pmchild = dlist_container(PMChild, elem, dlist_pop_head_node(freelist));
> > > + pmchild->pid = 0;
> > > + pmchild->bkend_type = btype;
> > > + pmchild->rw = NULL;
> > > + pmchild->bgworker_notify = true;
> > > +
> > > + /*
> > > + * pmchild->child_slot for each entry was initialized when the array of
> > > + * slots was allocated.
> > > + */
> > > +
> > > + dlist_push_head(&ActiveChildList, &pmchild->elem);
> > > +
> > > + ReservePostmasterChildSlot(pmchild->child_slot);
> > > +
> > > + /* FIXME: find a more elegant way to pass this */
> > > + MyPMChildSlot = pmchild->child_slot;
> >
> > What if we assigned one offset for each process and assigned its ID here and
> > also used that for its ProcNumber - that way we wouldn't need to manage
> > freelists in two places.
>
> It's currently possible to have up to 2 * max_connections backends in the
> authentication phase. We would have to change that behaviour, or make the
> PGPROC array 2x larger.

That however, might be too much...

> It might well be worth it, I don't know how sensible the current behaviour
> is. But I'd like to punt that to later patch, to keep the scope of this
> patch set reasonable. It's pretty straightforward to do later on top of this
> if we want to.

Makes sense.

I still think that we'd be better off to just return an error to the client in
postmaster, rather than deal with this dead-end children mess. That was
perhaps justified at some point, but now it seems to add way more complexity
than it's worth. And it's absurdly expensive to fork to return an error. Way
more expensive than just having postmaster send an error and close the socket.

> > > @@ -2469,11 +2410,15 @@ process_pm_child_exit(void)
> > > }
> > > /* Was it the system logger? If so, try to start a new one */
> > > - if (pid == SysLoggerPID)
> > > + if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
> > > {
> > > - SysLoggerPID = 0;
> > > /* for safety's sake, launch new logger *first* */
> > > - SysLoggerPID = SysLogger_Start();
> > > + SysLoggerPMChild->pid = SysLogger_Start();
> > > + if (SysLoggerPMChild->pid == 0)
> > > + {
> > > + FreePostmasterChildSlot(SysLoggerPMChild);
> > > + SysLoggerPMChild = NULL;
> > > + }
> > > if (!EXIT_STATUS_0(exitstatus))
> > > LogChildExit(LOG, _("system logger process"),
> >
> > Seems a bit weird to have one place with a different memory lifetime handling
> > than other places. Why don't we just do this the same way as in other places
> > but continue to defer the logging until after we tried to start the new
> > logger?
>
> Hmm, you mean let LaunchMissingBackgroundProcesses() handle the restart?

Yea - which it already can do, presumably to handle the case of
logging_collector. It just seems odd to have code to have three places calling
SysLogger_Start() - with some mild variations of the code.

Perhaps we can at least centralize some of that?

But you have a point with:

> I'm a little scared of changing the existing logic. We don't have a
> mechanism for deferring logging, so we would have to invent that, or the
> logs would just accumulate in the pipe until syslogger starts up. There's
> some code between here and LaunchMissingBackgroundProcesses(), so might
> postmaster get blocked between writing to the syslogger pipe, before having
> restarted it?
>
> If forking the syslogger process fails, that can happen anyway, though.

> > > /* Construct a process name for log message */
> > > +
> > > + /*
> > > + * FIXME: use GetBackendTypeDesc here? How does the localization of that
> > > + * work?
> > > + */
> > > if (bp->bkend_type == B_DEAD_END_BACKEND)
> > > {
> > > procname = _("dead end backend");
> >
> > Might be worth having a version of GetBackendTypeDesc() that returns a
> > translated string?
>
> Constructing the string for background workers is a little more complicated:

Random aside: I *hate* that there's no trivial way recognie background workers
in pg_stat_activity, because somebody made pg_stat_activity.backend_type
report something completely under control of extensions...

> > > @@ -2697,9 +2643,16 @@ HandleChildCrash(int pid, int exitstatus, const char *procname)
> > > {
> > > dlist_iter iter;
> > > - dlist_foreach(iter, &BackendList)
> > > + dlist_foreach(iter, &ActiveChildList)
> > > {
> > > - Backend *bp = dlist_container(Backend, elem, iter.cur);
> > > + PMChild *bp = dlist_container(PMChild, elem, iter.cur);
> > > +
> > > + /* We do NOT restart the syslogger */
> > > + if (bp == SysLoggerPMChild)
> > > + continue;
> >
> > That comment seems a bit misleading - we do restart syslogger, we just don't
> > do it here, no? I realize it's an old comment, but it still seems like it's
> > worth fixing given that you touch all the code here...
>
> No, we really do not restart the syslogger.

Hm?

/* Was it the system logger? If so, try to start a new one */
if (SysLoggerPMChild && pid == SysLoggerPMChild->pid)
{
/* for safety's sake, launch new logger *first* */
SysLoggerPMChild->pid = SysLogger_Start(SysLoggerPMChild->child_slot);
if (SysLoggerPMChild->pid == 0)
{
FreePostmasterChildSlot(SysLoggerPMChild);
SysLoggerPMChild = NULL;
}
if (!EXIT_STATUS_0(exitstatus))
LogChildExit(LOG, _("system logger process"),
pid, exitstatus);
continue;
}

We don't do it reaction to other processes crashing, but we still restart it
if it dies. Perhaps it's clear from context - but I had to think aobut it for
a moment.

>
> > > @@ -2871,29 +2786,27 @@ PostmasterStateMachine(void)
> > > <snip>
> > > - if (WalSummarizerPID != 0)
> > > - signal_child(WalSummarizerPID, SIGTERM);
> > > - if (SlotSyncWorkerPID != 0)
> > > - signal_child(SlotSyncWorkerPID, SIGTERM);
> > > + targetMask |= (1 << B_STARTUP);
> > > + targetMask |= (1 << B_WAL_RECEIVER);
> > > +
> > > + targetMask |= (1 << B_WAL_SUMMARIZER);
> > > + targetMask |= (1 << B_SLOTSYNC_WORKER);
> > > /* checkpointer, archiver, stats, and syslogger may continue for now */
> > > + SignalSomeChildren(SIGTERM, targetMask);
> > > +
> > > /* Now transition to PM_WAIT_BACKENDS state to wait for them to die */
> > > pmState = PM_WAIT_BACKENDS;
> > > <snip>
> >
> > It's likely the right thing to not do as one patch, but IMO this really wants
> > to be a state table. Perhaps as part of child_process_kinds, perhaps separate
> > from that.
>
> Yeah. I've tried to refactor this into a table before, but didn't come up
> with anything that I was happy with. I also feel there must be a better way
> to organize this, but not sure what exactly. I hope that will become more
> apparent after these other changes.

What I'm imagining is something like:
1) Make PMState values each have a distinct bit
2) Move PMState to some (new?) header
3) Add a "uint32 should_run" member to child_process_kind that's a bitmask of
PMStates
4) Add a new function in launch_backend.c that gets passed the "target"
PMState and returns a bitmask of the tasks that should be running (or the
inverse, doesn't really matter).
5) Instead of open-coding the targetMask "computation", use the new function
from 4).

I think that might not look too bad?

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mats Kindahl 2024-09-10 17:07:18 Re: Use streaming read API in ANALYZE
Previous Message Maxim Orlov 2024-09-10 16:53:02 Re: Latches vs lwlock contention