Re: Requiring recovery.signal or standby.signal when recovering with a backup_label

From: Bowen Shi <zxwsbg12138(at)gmail(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: David Zhang <david(dot)zhang(at)highgo(dot)ca>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
Date: 2023-09-21 03:45:06
Message-ID: CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8=2aw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Thanks for the patch.

I rerun the test in
https://www.postgresql.org/message-id/flat/ZQtzcH2lvo8leXEr%40paquier.xyz#cc5ed83e0edc0b9a1c1305f08ff7a335
. We can discuss all the problems in this thread.

First I encountered the problem " FATAL: could not find
recovery.signal or standby.signal when recovering with backup_label ",
then I deleted the backup_label file and started the instance
successfully.

> Delete a backup_label from a fresh base backup can easily lead to data
> corruption, as the startup process would pick up as LSN to start
> recovery from the control file rather than the backup_label file.
> This would happen if a checkpoint updates the redo LSN in the control
> file while a backup happens and the control file is copied after the
> checkpoint, for instance. If one wishes to deploy a new primary from
> a base backup, recovery.signal is the way to go, making sure that the
> new primary is bumped into a new timeline once recovery finishes, on
> top of making sure that the startup process starts recovery from a
> position where the cluster would be able to achieve a consistent
> state.

ereport(FATAL,
(errmsg("could not find redo location referenced by checkpoint record"),
errhint("If you are restoring from a backup, touch
\"%s/recovery.signal\" and add required recovery options.\n"
"If you are not restoring from a backup, try removing the file
\"%s/backup_label\".\n"
"Be careful: removing \"%s/backup_label\" will result in a corrupt
cluster if restoring from a backup.",
DataDir, DataDir, DataDir)));

There are two similar error messages in xlogrecovery.c. Maybe we can
modify the error messages to be similar.

--
Bowen Shi

On Thu, 21 Sept 2023 at 11:01, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Wed, Jul 19, 2023 at 11:21:17AM -0700, David Zhang wrote:
> > 1) simply start server from a base backup
> >
> > FATAL: could not find recovery.signal or standby.signal when recovering
> > with backup_label
> >
> > HINT: If you are restoring from a backup, touch
> > "/media/david/disk1/pg_backup1/recovery.signal" or
> > "/media/david/disk1/pg_backup1/standby.signal" and add required recovery
> > options.
>
> Note the difference when --write-recovery-conf is specified, where a
> standby.conf is created with a primary_conninfo in
> postgresql.auto.conf. So, yes, that's expected by default with the
> patch.
>
> > 2) touch a recovery.signal file and then try to start the server, the
> > following error was encountered:
> >
> > FATAL: must specify restore_command when standby mode is not enabled
>
> Yes, that's also something expected in the scope of the v1 posted.
> The idea behind this restriction is that specifying recovery.signal is
> equivalent to asking for archive recovery, but not specifying
> restore_command is equivalent to not provide any options to be able to
> recover. See validateRecoveryParameters() and note that this
> restriction exists since the beginning of times, introduced in commit
> 66ec2db. I tend to agree that there is something to be said about
> self-contained backups taken from pg_basebackup, though, as these
> would fail if no restore_command is specified, and this restriction is
> in place before Postgres has introduced replication and easier ways to
> have base backups. As a whole, I think that there is a good argument
> in favor of removing this restriction for the case where archive
> recovery is requested if users have all their WAL in pg_wal/ to be
> able to recover up to a consistent point, keeping these GUC
> restrictions if requesting a standby (not recovery.signal, only
> standby.signal).
>
> > 3) touch a standby.signal file, then the server successfully started,
> > however, it operates in standby mode, whereas the intended behavior was for
> > it to function as a primary server.
>
> standby.signal implies that the server will start in standby mode. If
> one wants to deploy a new primary, that would imply a timeline jump at
> the end of recovery, you would need to specify recovery.signal
> instead.
>
> We need more discussions and more opinions, but the discussion has
> stalled for a few months now. In case, I am adding Thomas Munro in CC
> who has mentioned to me at PGcon that he was interested in this patch
> (this thread's problem is not directly related to the fact that the
> checkpointer now runs in crash recovery, though).
>
> For now, I am attaching a rebased v2.
> --
> Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2023-09-21 03:46:29 Re: Synchronizing slots from primary to standby
Previous Message Hayato Kuroda (Fujitsu) 2023-09-21 03:25:12 CI: Unfamiliar error while testing macOS