From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> |
Cc: | pgsql-hackers(at)postgresql(dot)org, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, David Steele <david(at)pgmasters(dot)net> |
Subject: | Re: Use of backup_label not noted in log |
Date: | 2023-11-18 21:49:15 |
Message-ID: | 20231118214915.4vylhux6qq6mhzqr@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2023-11-18 10:01:42 -0800, Andres Freund wrote:
> > What about adding it to the "redo starts at" message, something like
> >
> > redo starts at 12/12345678, taken from control file
> >
> > or
> >
> > redo starts at 12/12345678, taken from backup label
>
> I think it'd make sense to log use of backup_label earlier than that - the
> locations from backup_label might end up not being available in the archive,
> the primary or locally, and we'll error out with "could not locate a valid
> checkpoint record".
>
> I'd probably just do it within the if (read_backup_label()) block in
> InitWalRecovery(), *before* the ReadCheckpointRecord().
Not enamored with the phrasing of the log messages, but here's a prototype:
When starting up with backup_label present:
LOG: starting from base backup with redo LSN A/34100028, checkpoint LSN A/34100080 on timeline ID 1
When restarting before reaching the end of the backup, but after backup_label
has been removed:
LOG: continuing to start from base backup with redo LSN A/34100028
LOG: entering standby mode
LOG: redo starts at A/3954B958
Note that the LSN in the "continuing" case is the one the backup started at,
not where recovery will start.
I've wondered whether it's worth also adding an explicit message just after
ReachedEndOfBackup(), but it seems far less urgent due to the existing
"consistent recovery state reached at %X/%X" message.
We are quite inconsistent about how we spell LSNs. Sometimes with LSN
preceding, sometimes not. Sometimes with (LSN). Etc.
> I do like the idea of expanding the "redo starts at" message
> though. E.g. including minRecoveryLSN, ControlFile->backupStartPoint,
> ControlFile->backupEndPoint would provide information about when the node
> might become consistent.
Playing around with this a bit, I'm wondering if we instead should remove that
message, and emit something more informative earlier on. If there's a problem,
you kinda want the information before we finally get to the loop in
PerformWalLRecovery(). If e.g. there's no WAL you'll only get
LOG: invalid checkpoint record
PANIC: could not locate a valid checkpoint record
which is delightfully lacking in details.
There also are some other oddities:
If the primary is down when starting up, and we'd need WAL from the primary
for the first record, the "redo start at" message is delayed until that
happens, because we emit the message not before we read the first record, but
after. That's just plain odd.
And sometimes we'll start referencing the LSN at which we are starting
recovery before the "redo starts at" message. If e.g. we shut down
at a restart point, we'll emit
LOG: consistent recovery state reached at ...
before
LOG: redo starts at ...
But that's all clearly just material for HEAD.
Greetings,
Andres Freund
Attachment | Content-Type | Size |
---|---|---|
v1-0001-wip-Log-when-starting-up-from-a-base-backup.patch | text/x-diff | 2.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-11-18 22:29:11 | PANIC serves too many masters |
Previous Message | Andres Freund | 2023-11-18 21:05:19 | Re: long-standing data loss bug in initial sync of logical replication |