Re: Fix assertion in autovacuum worker

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: David Geier <geidav(dot)pg(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Fix assertion in autovacuum worker
Date: 2023-11-29 02:48:59
Message-ID: 20231129024859.kajlgqg33dksonbi@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-28 20:42:47 -0600, Nathan Bossart wrote:
> On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote:
> > On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
> >> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
> >> is registered as a before_shmem_exit callback, while ProcKill is registered
> >> as an on_shmem_exit callback.
> >
> > That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you
> > can't after ProcKill(). It's also not unique to pgstat, several other
> > before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
> > ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
> > before_shmem_exit when git grepping) - which makes sense, they are presumably
> > before_shmem_exit() because they need to manage shared state, which often
> > needs locks.
> >
> > In normal backends this is fine-ish, because ShutdownPostgres() is registered
> > very late (and thus is called early in the shutdown sequence), and the
> > AbortOutOfAnyTransaction() contained therein indirectly calls
> > LWLockReleaseAll() and very little happens outside of the transaction context.
>
> Right. Perhaps we could add a LWLockReleaseAll() to
> pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
> is still just a hack.

Yea, we'd need that in just about all before_shmem_exit() callbacks. I could
see an argument for doing it in proc_exit_prepare(). While that'd be a fairly
gross layering violation, we already do reset a number a bunch of stuff in
there:
/*
* Forget any pending cancel or die requests; we're doing our best to
* close up shop already. Note that the signal handlers will not set
* these flags again, now that proc_exit_inprogress is set.
*/
InterruptPending = false;
ProcDiePending = false;
QueryCancelPending = false;
InterruptHoldoffCount = 1;
CritSectionCount = 0;

> >> I would expect your patch to fix this particular issue, but I'm wondering
> >> whether there's a bigger problem here.
> >
> > Yes, there is - our subsystem initialization, shutdown, error recovery
> > infrastructure is a mess. We've interwoven transaction handling far too
> > tightly with error handling, the order of subystem initialization is basically
> > random and differs between operating systems (due to EXEC_BACKEND) and "mode"
> > of execution (the order differs when using single user mode) and we've
> > distributed error recovery into ~10 places (all the sigsetjmp()s in backend
> > code, xact.c and and a few other places like WalSndErrorCleanup()).
>
> :(
>
> I do remember looking into uniting all the various sigsetjmp() calls
> before. That could be worth another try. The rest will probably require
> additional thought...

It'd definitely be worth some effort. I'm quite sure that we have a number of
hard to find bugs around this.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-11-29 02:58:46 Re: Synchronizing slots from primary to standby
Previous Message Nathan Bossart 2023-11-29 02:42:47 Re: Fix assertion in autovacuum worker