Quick Links

Re: Use of backup_label not noted in log

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc:	pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net>
Subject:	Re: Use of backup_label not noted in log
Date:	2023-11-18 21:49:15
Message-ID:	20231118214915.4vylhux6qq6mhzqr@awork3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
> > What about adding it to the "redo starts at" message, something like
> >
> > redo starts at 12/12345678, taken from control file
> >
> > or
> >
> > redo starts at 12/12345678, taken from backup label
>
> I think it'd make sense to log use of backup_label earlier than that - the
> locations from backup_label might end up not being available in the archive,
> the primary or locally, and we'll error out with "could not locate a valid
> checkpoint record".
>
> I'd probably just do it within the if (read_backup_label()) block in
> InitWalRecovery(), *before* the ReadCheckpointRecord().

Not enamored with the phrasing of the log messages, but here's a prototype:

When starting up with backup_label present:
LOG: starting from base backup with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1

When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG: continuing to start from base backup with redo LSN A/34100028
LOG: entering standby mode
LOG: redo starts at A/3954B958

Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.

I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.

We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.

> I do like the idea of expanding the "redo starts at" message
> though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
> ControlFile->backupEndPoint would provide information about when the node
> might become consistent.

Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG: invalid checkpoint record
PANIC: could not locate a valid checkpoint record

which is delightfully lacking in details.

There also are some other oddities:

If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.

And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit

LOG: consistent recovery state reached at ...
before
LOG: redo starts at ...

But that's all clearly just material for HEAD.

Greetings,

Andres Freund

Attachment	Content-Type	Size
v1-0001-wip-Log-when-starting-up-from-a-base-backup.patch	text/x-diff	2.3 KB

In response to

Re: Use of backup_label not noted in log at 2023-11-18 18:01:42 from Andres Freund

Responses

Re: Use of backup_label not noted in log at 2023-11-20 08:30:31 from Michael Paquier
Re: Use of backup_label not noted in log at 2023-11-20 13:56:47 from David Steele
Re: Use of backup_label not noted in log at 2023-11-20 19:08:15 from David Steele

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2023-11-18 22:29:11	PANIC serves too many masters
Previous Message	Andres Freund	2023-11-18 21:05:19	Re: long-standing data loss bug in initial sync of logical replication