From: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | David Geier <geidav(dot)pg(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Fix assertion in autovacuum worker |
Date: | 2023-11-29 02:42:47 |
Message-ID: | 20231129024247.GA479372@nathanxps13 |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Nov 28, 2023 at 04:03:49PM -0800, Andres Freund wrote:
> On 2023-11-28 16:05:16 -0600, Nathan Bossart wrote:
>> From a glance, it looks to me like the problem is that pgstat_shutdown_hook
>> is registered as a before_shmem_exit callback, while ProcKill is registered
>> as an on_shmem_exit callback.
>
> That's required, as pgstat_shutdown_hook() needs to acquire lwlocks, which you
> can't after ProcKill(). It's also not unique to pgstat, several other
> before_shmem_exit() callbacks acquire lwlocks (e.g. AtProcExit_Twophase(),
> ParallelWorkerShutdown(), do_pg_abort_backup(), the first three uses of
> before_shmem_exit when git grepping) - which makes sense, they are presumably
> before_shmem_exit() because they need to manage shared state, which often
> needs locks.
>
> In normal backends this is fine-ish, because ShutdownPostgres() is registered
> very late (and thus is called early in the shutdown sequence), and the
> AbortOutOfAnyTransaction() contained therein indirectly calls
> LWLockReleaseAll() and very little happens outside of the transaction context.
Right. Perhaps we could add a LWLockReleaseAll() to
pgstat_shutdown_hook() instead of the autovacuum code, but I'm afraid that
is still just a hack.
>> I would expect your patch to fix this particular issue, but I'm wondering
>> whether there's a bigger problem here.
>
> Yes, there is - our subsystem initialization, shutdown, error recovery
> infrastructure is a mess. We've interwoven transaction handling far too
> tightly with error handling, the order of subystem initialization is basically
> random and differs between operating systems (due to EXEC_BACKEND) and "mode"
> of execution (the order differs when using single user mode) and we've
> distributed error recovery into ~10 places (all the sigsetjmp()s in backend
> code, xact.c and and a few other places like WalSndErrorCleanup()).
:(
I do remember looking into uniting all the various sigsetjmp() calls
before. That could be worth another try. The rest will probably require
additional thought...
--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-11-29 02:48:59 | Re: Fix assertion in autovacuum worker |
Previous Message | Andres Freund | 2023-11-29 02:38:56 | Re: common signal handler protection |