Re: Make mesage at end-of-recovery less scary.

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: bossartn(at)amazon(dot)com, david(at)pgmasters(dot)net, peter(dot)eisentraut(at)2ndquadrant(dot)com, andres(at)anarazel(dot)de, pgsql-hackers(at)lists(dot)postgresql(dot)org, jtc331(at)gmail(dot)com, robertmhaas(at)gmail(dot)com
Subject: Re: Make mesage at end-of-recovery less scary.
Date: 2021-11-09 00:53:15
Message-ID: YYnGe0tROh+1SqIc@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Nov 08, 2021 at 02:59:46PM +0900, Kyotaro Horiguchi wrote:
> While checking the behavior for the case of missing-contrecord, I
> noticed that emode_for_corrupt_record() doesn't work as expected since
> readSource is reset to XLOG_FROM_ANY after a read failure. We could
> remember the last failed source but pg_wal should have been visited if
> page read error happened so I changed the function so that it treats
> XLOG_FROM_ANY the same way with XLOG_FROM_PG_WAL.

FWIW, I am not much a fan of assuming that it is fine to use
XLOG_FROM_ANY as a condition here. The comments on top of
emode_for_corrupt_record() make it rather clear what the expectations
are, and this is the default readSource.

> (Otherwise we see "LOG: reached end-of-WAL at .." message after
> "WARNING: missing contrecord at.." message.)

+ /* broken record found */
+ ereport(WARNING,
+ (errmsg("redo is skipped"),
+ errhint("This suggests WAL file corruption. You might need to check the database.")));
This looks rather scary to me, FWIW, and this could easily be reached
if one forgets about EndOfWAL while hacking on xlogreader.c.
Unlikely so, still.

+ report_invalid_record(state,
+ "missing contrecord at %X/%X",
+ LSN_FORMAT_ARGS(RecPtr));
Isn't there a risk here to break applications checking after error
messages stored in the WAL reader after seeing a contrecord?

+ if (record->xl_tot_len == 0)
+ {
+ /* This is strictly not an invalid state, so phrase it as so. */
+ report_invalid_record(state,
+ "record length is 0 at %X/%X",
+ LSN_FORMAT_ARGS(RecPtr));
+ state->EndOfWAL = true;
+ return false;
+ }
This assumes that a value of 0 for xl_tot_len is a synonym of the end
of WAL, but cannot we have also a corrupted record in this case in the
shape of xl_tot_len being 0? We validate the full record after
reading the header, so it seems to me that we should not assume that
things are just ending as proposed in this patch.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2021-11-09 01:47:11 Improve error context after some failed XLogReadRecord()
Previous Message Jeff Davis 2021-11-09 00:13:17 Re: Predefined role pg_maintenance for VACUUM, ANALYZE, CHECKPOINT.