Re: Preventing hangups in bgworker start/stop during DB shutdown

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Preventing hangups in bgworker start/stop during DB shutdown
Date: 2020-12-24 12:49:51
Message-ID: CALj2ACWWiqC1Ci3dr4nTbpxH78z=fGO14uk6NEaT14zpbutKnA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 23, 2020 at 3:10 AM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
> Here's an attempt at closing the race condition discussed in [1]
> (and in some earlier threads, though I'm too lazy to find them).
>
> The core problem is that the bgworker management APIs were designed
> without any thought for exception conditions, notably "we're not
> gonna launch any more workers because we're shutting down the database".
> A process waiting for a worker in WaitForBackgroundWorkerStartup or
> WaitForBackgroundWorkerShutdown will wait forever, so that the database
> fails to shut down without manual intervention.
>
> I'd supposed that we would need some incompatible changes in those APIs
> in order to fix this, but after further study it seems like we could
> hack things by just acting as though a request that won't be serviced
> has already run to completion. I'm not terribly satisfied with that
> as a long-term solution --- it seems to me that callers should be told
> that there was a failure. But this looks to be enough to solve the
> lockup condition for existing callers, and it seems like it'd be okay
> to backpatch.
>
> Thoughts?
>
> [1] https://www.postgresql.org/message-id/flat/16785-c0207d8c67fb5f25%40postgresql.org

1) Yeah, the postmaster will not be able to start the bg workers in
following cases, when bgworker_should_start_now returns false. So we
might encounter the hang issue.
static bool
bgworker_should_start_now(BgWorkerStartTime start_time)
{
switch (pmState)
{
case PM_NO_CHILDREN:
case PM_WAIT_DEAD_END:
case PM_SHUTDOWN_2:
case PM_SHUTDOWN:
case PM_WAIT_BACKENDS:
case PM_STOP_BACKENDS:
break;

2) What if postmaster enters pmState >= PM_STOP_BACKENDS state after
it calls BackgroundWorkerStateChange(pmState < PM_STOP_BACKENDS)?
First of all, is it possible? I think yes, because we are in
sigusr1_handler(), and I don't see we blocking the signal that sets
pmState >= PM_STOP_BACKENDS either in sigusr1_handler or in
BackgroundWorkerStateChange(). Though it's a small window, we might
get into the hangup issue? If yes, can we check the pmState in the for
loop in BackgroundWorkerStateChange()?

if (CheckPostmasterSignal(PMSIGNAL_BACKGROUND_WORKER_CHANGE))
{
- BackgroundWorkerStateChange();
+ /* Accept new dynamic worker requests only if not stopping. */
+ BackgroundWorkerStateChange(pmState < PM_STOP_BACKENDS);
StartWorkerNeeded = true;
}

3) Can we always say that if bgw_restart_time is BGW_NEVER_RESTART,
then it's a dynamic bg worker? I think we can also have normal
bgworkers with BGW_NEVER_RESTART flag(I see one in worker_spi.c
_PG_init()), will there be any problem? Or do we need some comment
tweaking?

+ /*
+ * If this is a dynamic worker request, and we aren't allowing new
+ * dynamic workers, then immediately mark it for termination; the next
+ * stanza will take care of cleaning it up.
+ */
+ if (slot->worker.bgw_restart_time == BGW_NEVER_RESTART &&
+ !allow_new_workers)
+ slot->terminate = true;

4) IIUC, in the patch we mark slot->terminate = true only for
BGW_NEVER_RESTART kind bg workers, what happens if a bg worker has
bgw_restart_time seconds and don't we hit the hanging issue(that we
are trying to solve here) for those bg workers?

With Regards,
Bharath Rupireddy.
EnterpriseDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message k.jamison@fujitsu.com 2020-12-24 13:29:53 RE: [Patch] Optimize dropping of relation buffers using dlist
Previous Message Tomas Vondra 2020-12-24 12:38:03 Re: How is this possible "publication does not exist"