Re: Weird failure with latches in curculio on v15

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz>, Robert Haas <robertmhaas(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Fujii Masao <fujii(at)postgresql(dot)org>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Weird failure with latches in curculio on v15
Date: 2023-02-04 11:20:34
Message-ID: 20230204112034.egwp52oa2f5i3crk@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-02-03 10:54:17 -0800, Nathan Bossart wrote:
> @@ -146,7 +146,25 @@ ExecuteRecoveryCommand(const char *command, const char *commandName,
> */
> fflush(NULL);
> pgstat_report_wait_start(wait_event_info);
> +
> + /*
> + * PreRestoreCommand() is used to tell the SIGTERM handler for the startup
> + * process that it is okay to proc_exit() right away on SIGTERM. This is
> + * done for the duration of the system() call because there isn't a good
> + * way to break out while it is executing. Since we might call proc_exit()
> + * in a signal handler here, it is extremely important that nothing but the
> + * system() call happens between the calls to PreRestoreCommand() and
> + * PostRestoreCommand(). Any additional code must go before or after this
> + * section.
> + */
> + if (exitOnSigterm)
> + PreRestoreCommand();
> +
> rc = system(command);
> +
> + if (exitOnSigterm)
> + PostRestoreCommand();
> +
> pgstat_report_wait_end();
>
> if (rc != 0)

It's somewhat weird that we now call the startup-process specific
PreRestoreCommand/PostRestoreCommand() in other processes than the
startup process. Gated on a variable that's not immediately obviously
tied to being in the startup process.

I think at least we ought to add comments to PreRestoreCommand /
PostRestoreCommand that they need to be robust against being called
outside of the startup process, and similarly a comment in
ExecuteRecoveryCommand(), explaining that all this stuff just works in
the startup process.

> diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
> index bcd23542f1..503eb1a5a6 100644
> --- a/src/backend/postmaster/startup.c
> +++ b/src/backend/postmaster/startup.c
> @@ -19,6 +19,8 @@
> */
> #include "postgres.h"
>
> +#include <unistd.h>
> +
> #include "access/xlog.h"
> #include "access/xlogrecovery.h"
> #include "access/xlogutils.h"
> @@ -121,7 +123,17 @@ StartupProcShutdownHandler(SIGNAL_ARGS)
> int save_errno = errno;
>
> if (in_restore_command)
> - proc_exit(1);
> + {
> + /*
> + * If we are in a child process (e.g., forked by system() in
> + * shell_restore()), we don't want to call any exit callbacks. The
> + * parent will take care of that.
> + */
> + if (MyProcPid == (int) getpid())
> + proc_exit(1);
> + else
> + _exit(1);

I think it might be worth adding something like
const char msg[] = "StartupProcShutdownHandler() called in child process";
write(STDERR_FILENO, msg, sizeof(msg));
to this path. Otherwise it might end up being a very hard to debug path.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2023-02-04 11:24:10 Re: [PATCH] Use indexes on the subscriber when REPLICA IDENTITY is full on the publisher
Previous Message Amit Kapila 2023-02-04 11:04:10 Re: Exit walsender before confirming remote flush in logical replication