Re: FSM Corruption (was: Could not read block at end of the relation)

From: Noah Misch <noah(at)leadboat(dot)com>
To: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc: pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)
Date: 2024-03-04 19:03:12
Message-ID: 20240304190312.b6.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Mar 04, 2024 at 02:10:39PM +0100, Ronan Dunklau wrote:
> Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :
> > On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> > > - happens during heavy system load
> > > - lots of concurrent writes happening on a table
> > > - often (but haven't been able to confirm it is necessary), a vacuum is
> > > running on the table at the same time the error is triggered

> Looking at when the corruption was WAL-logged, this particular case is quite
> easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table
> (probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM
> running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog
> records for each of the relation main fork, followed by a lonely FPI for the
> leaf page of it's FSM:

You're using data_checksums, right? Thanks for the wal dump excerpts; I agree
with this summary thereof.

> There are no traces of relation truncation happening in the WAL.

That is notable.

> This case only shows a single invalid entry in the FSM, but I've noticed as
> much as 62 blocks present in the FSM while they do not exist on disk, all
> tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk
> extension mechanism.

Is this happening after an OS crash, a replica promote, or a PITR restore? If
so, I think I see the problem. We have an undocumented rule that FSM shall
not contain references to pages past the end of the relation. To facilitate
that, relation truncation WAL-logs FSM truncate. However, there's no similar
protection for relation extension, which is not WAL-logged. We break the rule
whenever we write FSM for block X before some WAL record initializes block X.
data_checksums makes the trouble easier to hit, since it creates FPI_FOR_HINT
records for FSM changes. A replica promote or PITR ending just after the FSM
FPI_FOR_HINT would yield this broken state. While v16 RelationAddBlocks()
made this easier to hit, I suspect it's reproducible in all supported
branches. For example, lazy_scan_new_or_empty() and multiple index AMs break
the rule via RecordPageWithFreeSpace() on a PageIsNew() page.

I think the fix is one of:

- Revoke the undocumented rule. Make FSM consumers resilient to the FSM
returning a now-too-large block number.

- Enforce a new "main-fork WAL before FSM" rule for logged rels. For example,
in each PageIsNew() case, either don't update FSM or WAL-log an init (like
lazy_scan_new_or_empty() does when PageIsEmpty()).

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ronan Dunklau 2024-03-04 22:21:07 Re: FSM Corruption (was: Could not read block at end of the relation)
Previous Message Tom Lane 2024-03-04 17:11:33 Re: BUG #18375: requested statistics kind "f" is not yet built for statistics object 16722