Re: FSM Corruption (was: Could not read block at end of the relation)

From: Noah Misch <noah(at)leadboat(dot)com>
To: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc: pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)
Date: 2024-03-04 23:05:03
Message-ID: 20240304230503.a4.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Mar 04, 2024 at 11:21:07PM +0100, Ronan Dunklau wrote:
> Le lundi 4 mars 2024, 20:03:12 CET Noah Misch a écrit :
> > Is this happening after an OS crash, a replica promote, or a PITR restore?
>
> I need to double check all occurences, but I wouldn't be surprised if they all
> went trough a replica promotion.
>
> > If so, I think I see the problem. We have an undocumented rule that FSM
> > shall not contain references to pages past the end of the relation. To
> > facilitate that, relation truncation WAL-logs FSM truncate. However,
> > there's no similar protection for relation extension, which is not
> > WAL-logged. We break the rule whenever we write FSM for block X before
> > some WAL record initializes block X. data_checksums makes the trouble
> > easier to hit, since it creates FPI_FOR_HINT records for FSM changes. A
> > replica promote or PITR ending just after the FSM FPI_FOR_HINT would yield
> > this broken state. While v16 RelationAddBlocks() made this easier to hit,
> > I suspect it's reproducible in all supported branches. For example,
> > lazy_scan_new_or_empty() and multiple index AMs break the rule via
> > RecordPageWithFreeSpace() on a PageIsNew() page.
>
> Very interesting. I understand the reasoning, and am now able to craft a very
> crude test case, which I will refine into something more actionable:
>
> - start a session 1, which will just vacuum our table continuously
> - start a ession 2, which will issue checkpoints continuously
> - in a session 3, COPY enough data so that we prealllocate too many pages. In
> my example, I 'm copying 33150 single integer rows, which fit on a bit more
> than 162 pages. The idea behind that is to make sure we don't fall on a round
> number of pages, and we leave several iterations of the extend mechanism
> running to issue a full page write of the fsm.
>
> So to trigger the bug, it seems to me one needs to run in parallel:
> - COPY
> - VACUUM
> - CHECKPOINT
>
> Which would explain why a surprisingly large number of occurences I've seen
> seemed to involve a pg_restore invocation.

Does that procedure alone reproduce the ERROR, without any replica promote,
postmaster restart, etc.? That, I do not have reason to expect.

> > I think the fix is one of:
> >
> > - Revoke the undocumented rule. Make FSM consumers resilient to the FSM
> > returning a now-too-large block number.
>
> Performance wise, that seems to be the better answer

I would guess this one is more risky from a performance perspective, since
we'd be adding to a hotter path under RelationGetBufferForTuple(). Still,
it's likely fine.

> but won't this kind of
> built-in resilience encourage subtle bugs creeping in ?

Since the !wal_log_hints case avoids WAL for FSM, we've already accepted
subtle bugs. I bet an unlucky torn FSM page could reach the ERROR you
reached, even if we introduced a "main-fork WAL before FSM" rule. We'll have
plenty of need for resilience. Hence, I'd lean toward this approach.

> > - Enforce a new "main-fork WAL before FSM" rule for logged rels. For
> > example, in each PageIsNew() case, either don't update FSM or WAL-log an
> > init (like lazy_scan_new_or_empty() does when PageIsEmpty()).
>
> The FSM is not updated by the caller: in the bulk insert case, we
> intentionally don't add them to the FSM. So having VACUUM WAL-log the page in
> lazy_scan_new_or_empty in the New case as you propose seems a very good idea.
>
> Performance wise that would keep the WAL logging outside of the backend
> performing the preventive work for itself or others, and it should not be that
> many pages that it gives VACUUM too much work.

A drawback is that this approach touches several access methods, so
out-of-tree access methods also likely need changes. Still, this is a good
backup if the first option measurably slows INSERT throughput.

> I can try my hand at such a patch it if looks like a good idea.

Please do.

> Thank you very much for your insights.

Thanks for the detailed report that made it possible.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Ronan Dunklau 2024-03-04 23:33:13 Re: FSM Corruption (was: Could not read block at end of the relation)
Previous Message Michael Paquier 2024-03-04 22:45:55 Re: BUG #18376: pg_upgrade fails trying to restore privileges on retired function pg_start_backup and pg_stop_backup