Quick Links

Re: FSM Corruption (was: Could not read block at end of the relation)

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc:	pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: FSM Corruption (was: Could not read block at end of the relation)
Date:	2024-03-03 23:47:15
Message-ID:	20240303234715.4d@rfd.leadboat.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> - happens during heavy system load
> - lots of concurrent writes happening on a table
> - often (but haven't been able to confirm it is necessary), a vacuum is running
> on the table at the same time the error is triggered
>
> Then, several backends get the same error at once "ERROR: could not read
> block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with different

What are some of the specific block numbers reported?

> has anybody witnessed something similar ?

https://postgr.es/m/flat/CA%2BhUKGK%2B5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8D%2BZg%40mail.gmail.com
reminded me of this. Did you upgrade your OS recently?

On Fri, Mar 01, 2024 at 09:56:51AM +0100, Ronan Dunklau wrote:
> I think I may have missed something on my first look. On other affected
> clusters, the FSM is definitely corrupted. So it looks like we have an FSM
> corruption bug on our hands.

What corruption signs did you observe in the FSM? Since FSM is intentionally
not WAL-logged, corruption is normal, but corruption causing errors is not
normal. That said, if any crash leaves a state that the freespace/README
"self-correcting measures" don't detect, errors may happen. Did the clusters
crash recently?

> The occurence of this bug happening makes it hard to reproduce, but it's
> definitely frequent enough we witnessed it on a dozen PostgreSQL clusters.

You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the
problem stops happening. That would corroborate the VACUUM theory.

Can you use backtrace_functions to get a stack track?

> In our case, we need to repair the FSM. The instructions on the wiki do work,
> but maybe we should add something like the attached patch (modeled after the
> same feature in pg_visibility) to make it possible to repair the FSM
> corruption online. What do you think about it ?

That's reasonable in concept.

In response to

FSM Corruption (was: Could not read block at end of the relation) at 2024-03-01 08:56:51 from Ronan Dunklau

Responses

Re: FSM Corruption (was: Could not read block at end of the relation) at 2024-03-04 13:10:39 from Ronan Dunklau

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tomas Vondra	2024-03-03 23:49:23	Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker
Previous Message	Thomas Munro	2024-03-03 22:12:11	Re: BUG #18349: ERROR: invalid DSA memory alloc request size 1811939328, CONTEXT: parallel worker