From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Fujii Masao <fujii(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Cc: | Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Weird failure with latches in curculio on v15 |
Date: | 2023-02-01 10:55:14 |
Message-ID: | 20230201105514.rsjl4bnhb65giyvo@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> My database off assertion failures has four like that, all 15 and master:
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-02-01%2001:05:17
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2023-01-11%2011:16:40
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=curculio&dt=2022-11-22%2012:19:21
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=conchuela&dt=2022-11-17%2021:47:02
>
> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> handler which is allowed to call that while in_restore_command is
> true.
Ugh, no wonder we're getting crashes. This whole business seems bogus as
hell.
RestoreArchivedFile():
...
/*
* Check signals before restore command and reset afterwards.
*/
PreRestoreCommand();
/*
* Copy xlog from archival storage to XLOGDIR
*/
ret = shell_restore(xlogfname, xlogpath, lastRestartPointFname);
PostRestoreCommand();
/* SIGTERM: set flag to abort redo and exit */
static void
StartupProcShutdownHandler(SIGNAL_ARGS)
{
int save_errno = errno;
if (in_restore_command)
proc_exit(1);
else
shutdown_requested = true;
WakeupRecovery();
errno = save_errno;
}
Where PreRestoreCommand()/PostRestoreCommand() set in_restore_command.
There's *a lot* of stuff happening inside shell_restore() that's not
compatible with doing proc_exit() inside a signal handler. We're
allocating memory! Interact with stdout.
There's also the fact that system() isn't signal safe, but that's a much
less likely problematic issue.
This appears to have gotten worse over a sequence of commits. The
following commits each added something betwen PreRestoreCommand() and
PostRestoreCommand().
commit 1b06d7bac901e5fd20bba597188bae2882bf954b
Author: Fujii Masao <fujii(at)postgresql(dot)org>
Date: 2021-11-22 10:28:21 +0900
Report wait events for local shell commands like archive_command.
added pgstat_report_wait_start/end. Unlikely to cause big issues, but
not good.
commit 7fed801135bae14d63b11ee4a10f6083767046d8
Author: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Date: 2022-08-29 13:55:38 -0400
Clean up inconsistent use of fflush().
Made it a bit worse by adding an fflush(). That certainly seems like it
could cause hangs.
commit 9a740f81eb02e04179d78f3df2ce671276c27b07
Author: Michael Paquier <michael(at)paquier(dot)xyz>
Date: 2023-01-16 16:31:43 +0900
Refactor code in charge of running shell-based recovery commands
which completely broke the mechanism. We suddenly run the entirety of
shell_restore(), which does pallocs etc to build the string passed to
system, and raises errors, all within a section in which a signal
handler can invoke proc_exit(). That's just completely broken.
Sorry, but particularly in this area, you got to be a heck of a lot more
careful.
I don't see a choice but to revert the recent changes. They need a
fairly large rewrite.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2023-02-01 11:21:20 | Re: heapgettup refactoring |
Previous Message | Dagfinn Ilmari Mannsåker | 2023-02-01 10:50:42 | Re: Clarify deleting comments and security labels in synopsis |