From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Fujii Masao <fujii(at)postgresql(dot)org>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Weird failure with latches in curculio on v15 |
Date: | 2023-02-01 16:58:01 |
Message-ID: | 20230201165801.33ydbxvjdbomjqa7@alap3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-02-01 10:12:26 -0500, Tom Lane wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > On 2023-02-01 16:21:16 +1300, Thomas Munro wrote:
> >> It's always in proc_exit() in StartupProcShutdownHandler(), a SIGTERM
> >> handler which is allowed to call that while in_restore_command is
> >> true.
>
> > Ugh, no wonder we're getting crashes. This whole business seems bogus as
> > hell.
>
> Indeed :-(
>
> > I don't see a choice but to revert the recent changes. They need a
> > fairly large rewrite.
>
> 9a740f81e clearly made things a lot worse, but it wasn't great
> before. Can we see a way forward to removing the problem entirely?
Yea, I think we can - we should stop relying on system(). If we instead
run the command properly as a subprocess, we don't need to do bad things
in the signal handler anymore.
> The fundamental issue is that we have no good way to break out
> of system(), and I think the original idea was that
> in_restore_command would be set *only* for the duration of the
> system() call. That's clearly been lost sight of completely,
> but maybe as a stopgap we could try to get back to that.
We could push the functions setting in_restore_command down into
ExecuteRecoveryCommand(). But I don't think that'd end up necessarily
being right either - we'd now use the mechanism in places we previously
didn't (cleanup/end commands).
And there's just plenty other stuff in the 14bdb3f13de 9a740f81eb0 that
doesn't look right:
- We now have two places open-coding what BuildRestoreCommand did
- I'm doubtful that the new shell_* functions are the base for a good
API to abstract restoring files
- the error message for a failed restore command seems to have gotten
worse:
could not restore file \"%s\" from archive: %s"
->
"%s \"%s\": %s", commandName, command
- shell_* imo is not a good namespace for something called from xlog.c,
xlogarchive.c. I realize the intention is that shell_archive.c is
going to be its own "restore module", but for now it imo looks odd
- The comment moved out of RestoreArchivedFile() doesn't seems less
useful at its new location
- explanation of why we use GetOldestRestartPoint() is halfway lost
My name is listed as the first Reviewed-by, but I certainly haven't done
any meaningful review of these patches. I just replied to top-level
email proposing "recovery modules".
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2023-02-01 17:08:24 | Re: Weird failure with latches in curculio on v15 |
Previous Message | Pavel Borisov | 2023-02-01 16:57:45 | Re: About PostgreSQL Core Team |