Re: FSM Corruption (was: Could not read block at end of the relation)

From: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)
Date: 2024-03-04 13:10:39
Message-ID: 1925490.taCxCBeP46@aivenlaptop
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Le lundi 4 mars 2024, 00:47:15 CET Noah Misch a écrit :
> On Tue, Feb 27, 2024 at 11:34:14AM +0100, Ronan Dunklau wrote:
> > - happens during heavy system load
> > - lots of concurrent writes happening on a table
> > - often (but haven't been able to confirm it is necessary), a vacuum is
> > running on the table at the same time the error is triggered
> >
> > Then, several backends get the same error at once "ERROR: could not read
> > block XXXX in file "base/XXXX/XXXX": read only 0 of 8192 bytes", with
> > different
> What are some of the specific block numbers reported?

Nothing specific, except they are past the end of the table, by an offset
starting at one. I've seen it happens on regular tables restored by pg_restore
quite a lot, but also on bigger tables, regular or TOAST. We've seen this on a
dozen different clusters, but nothing reproducible for now.

>
> > has anybody witnessed something similar ?
>
> https://postgr.es/m/flat/CA%2BhUKGK%2B5DOmLaBp3Z7C4S-Yv6yoROvr1UncjH2S1ZbPT8
> D%2BZg%40mail.gmail.com reminded me of this. Did you upgrade your OS
> recently?

No recent upgrades I'm aware of. Also, it only touches PG16 clusters, not
earlier versions. But it looks like the same kind of errors.

> What corruption signs did you observe in the FSM? Since FSM is
> intentionally not WAL-logged, corruption is normal, but corruption causing
> errors is not normal. That said, if any crash leaves a state that the
> freespace/README "self-correcting measures" don't detect, errors may
> happen. Did the clusters crash recently?

Well we still have full page images for the FSM.
I'll take as an example the cluster I'm currently investigating on.
The corruption I'm seeing is that I have an additional entry in the FSM for a
block that does not exist in the relation: IIUC, slot 4129 corresponds to
block 34, which is the last one I have in the relation. However, I have an
entry (slot 4130) for block number 35, with MaxFSMRequestSize

fsm_page_contents
-------------------
0: 255 +
1: 255 +
3: 255 +
...
4129: 114 +
4130: 255 +

This then causes attempts to insert into that block to fail.

Looking at when the corruption was WAL-logged, this particular case is quite
easy to trace. We have a few MULTI-INSERTS+INIT intiially loading the table
(probably a pg_restore), then, 2GB of WAL later, what looks like a VACUUM
running on the table: a succession of FPI_FOR_HINT, FREEZE_PAGE, VISIBLE xlog
records for each of the relation main fork, followed by a lonely FPI for the
leaf page of it's FSM:

rmgr: Heap2 len (rec/tot): 5698/ 5698, tx: 1637, lsn:
1/810763B8, prev 1/81076388, desc: MULTI_INSERT+INIT ntuples: 124, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 0
rmgr: XLOG len (rec/tot): 49/ 8241, tx: 1637, lsn:
1/81077A00, prev 1/810763B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 fork fsm blk 2 FPW
rmgr: Heap2 len (rec/tot): 5762/ 5762, tx: 1637, lsn:
1/81079A50, prev 1/81077A00, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 1
rmgr: Heap2 len (rec/tot): 5746/ 5746, tx: 1637, lsn:
1/8107B0F0, prev 1/81079A50, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 2
rmgr: Heap2 len (rec/tot): 5778/ 5778, tx: 1637, lsn:
1/8107C780, prev 1/8107B0F0, desc: MULTI_INSERT+INIT ntuples: 121, flags: 0x08,
blkref #0: rel 1663/17343/17959 blk 3

(up until blk 34 is wal logged)

At some point pages 0 and 1 get a FPI, the leaf page (blk2) does not during
the transaction.

Later on, we have this for every main fork block:

rmgr: XLOG len (rec/tot): 49/ 8201, tx: 0, lsn:
2/0E1ECEC8, prev 2/0E1ECE10, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 blk 0 FPW
rmgr: Heap2 len (rec/tot): 325/ 325, tx: 0, lsn:
2/0E1EEEF0, prev 2/0E1ECEC8, desc: FREEZE_PAGE snapshotConflictHorizon: 1637,
nplans: 2, plans: [{ xmax: 0, infomask: 2816, infomask2: 5, ntuples: 86,
offsets: [1, 2, 3, 5, 7, 10, 14, 16, 21, 22, 28, 30, 31, 32, 33, 34, 35, 36,
37, 38, 39, 40, 41, 42, 45, 46, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 60,
64, 65, 67, 68, 69, 71, 72, 73, 74, 75, 76, 77, 78, 79, 81, 82, 83, 85, 86,
87, 90, 92, 94, 95, 96, 98, 100, 102, 103, 104, 105, 106, 107, 108, 109, 110,
111, 112, 113, 114, 115, 117, 118, 119, 120, 121, 122, 123, 124] }, { xmax: 0,
infomask: 2817, infomask2: 5, ntuples: 38, offsets: [4, 6, 8, 9, 11, 12, 13,
15, 17, 18, 19, 20, 23, 24, 25, 26, 27, 29, 43, 44, 47, 48, 59, 61, 62, 63,
66, 70, 80, 84, 88, 89, 91, 93, 97, 99, 101, 116] }], blkref #0: rel
1663/17343/17959 blk 0
rmgr: Heap2 len (rec/tot): 64/ 8256, tx: 0, lsn:
2/0E1EF038, prev 2/0E1EEEF0, desc: VISIBLE snapshotConflictHorizon: 0, flags:
0x03, blkref #0: rel 1663/17343/17959 fork vm blk 0 FPW, blkref #1: rel
1663/17343/17959 blk 0

And then the single FSM FPI, which contains the invalid record:

rmgr: XLOG len (rec/tot): 49/ 8241, tx: 0, lsn:
2/0E28F640, prev 2/0E28F600, desc: FPI_FOR_HINT , blkref #0: rel
1663/17343/17959 fork fsm blk 2 FPW

I've dumped the associated FPIs, and took a look at them. First FPI for leaf
block (#2 in the fsm) happens in the beginning of the multi insert
transaction, and only contains one leaf node for block 0 (slot 4095).

Later FPI (during what I think is a VACUUM) persists the corruption in the
WAL.
So presumably something happened during the COPY operation which caused it to
add an invalid entry for inexisting block 35 in the FSM.

There are no traces of relation truncation happening in the WAL.

This case only shows a single invalid entry in the FSM, but I've noticed as
much as 62 blocks present in the FSM while they do not exist on disk, all
tagged with MaxFSMRequestSize so I suppose something is wrong with the bulk
extension mechanism.

>
> > The occurence of this bug happening makes it hard to reproduce, but it's
> > definitely frequent enough we witnessed it on a dozen PostgreSQL clusters.
>
> You could do "ALTER TABLE x SET (vacuum_truncate = off);" and see if the
> problem stops happening. That would corroborate the VACUUM theory.
>
> Can you use backtrace_functions to get a stack track?

I got a stack trace from the error after the corruption has occured (even got
to attach gdb to it as I was able to restore a corrupted cluster), but I don't
think it's of any use (this is from a different occurence than the one I wrote
in details above, so the block numbers don't match but it's the exact same
thing).

#0 errstart (elevel=21, domain=0x0) at utils/error/elog.c:348
#1 0x000055c81ea83a0a in errstart_cold (domain=0x0, elevel=21) at utils/
error/elog.c:333
#2 mdread (reln=0x55c8210bbe00, forknum=MAIN_FORKNUM, blocknum=1203,
buffer=0x7f51b4c32000) at storage/smgr/md.c:796
#3 0x000055c81ee396cf in smgrread (buffer=0x7f51b4c32000, blocknum=1203,
forknum=MAIN_FORKNUM, reln=0x55c8210bbe00) at storage/smgr/smgr.c:565
#4 ReadBuffer_common (smgr=0x55c8210bbe00,
relpersistence=relpersistence(at)entry=112 'p',
forkNum=forkNum(at)entry=MAIN_FORKNUM, blockNum=blockNum(at)entry=1203,
mode=mode(at)entry=RBM_NORMAL, strategy=<optimized out>, hit=0x7fff6e22f607)
at storage/buffer/bufmgr.c:1124
#5 0x000055c81ee39d0b in ReadBufferExtended (reln=0x7f51b294f0d0,
forkNum=MAIN_FORKNUM, blockNum=1203, mode=RBM_NORMAL, strategy=<optimized
out>)
at access/brin/../../../../src/include/utils/rel.h:576
#6 0x000055c81eb1525c in ReadBufferBI (mode=RBM_NORMAL, bistate=0x0,
targetBlock=1203, relation=<optimized out>) at access/heap/hio.c:96
#7 RelationGetBufferForTuple (relation=<optimized out>, len=<optimized out>,
otherBuffer=0, options=<optimized out>, bistate=0x0, vmbuffer=<optimized out>,
vmbuffer_other=<optimized out>, num_pages=<optimized out>) at access/heap/
hio.c:620
#8 0x000055c81eafcace in heap_insert (relation=0x7f51b294f0d0,
tup=0x55c821042580, cid=<optimized out>, options=0, bistate=0x0) at access/
heap/heapam.c:1862
#9 0x000055c81eb08bab in heapam_tuple_insert (relation=0x7f51b294f0d0,
slot=0x55c821042478, cid=0, options=0, bistate=0x0) at access/heap/
heapam_handler.c:252
#10 0x000055c81ecce2f1 in table_tuple_insert (bistate=0x0, options=0,
cid=<optimized out>, slot=0x55c821042478, rel=<optimized out>)
at executor/../../../src/include/access/tableam.h:1400
#11 ExecInsert (context=context(at)entry=0x7fff6e22f9b0,
resultRelInfo=resultRelInfo(at)entry=0x55c8210410c0,
slot=slot(at)entry=0x55c821042478, canSetTag=<optimized out>,
inserted_tuple=inserted_tuple(at)entry=0x0,
insert_destrel=insert_destrel(at)entry=0x0) at executor/nodeModifyTable.c:1133
#12 0x000055c81eccf1eb in ExecModifyTable (pstate=<optimized out>) at
executor/nodeModifyTable.c:3810

Just a regular insertion failing to read the block it's trying to insert into.

> > In our case, we need to repair the FSM. The instructions on the wiki do
> > work, but maybe we should add something like the attached patch (modeled
> > after the same feature in pg_visibility) to make it possible to repair
> > the FSM corruption online. What do you think about it ?
>
> That's reasonable in concept.

Ok, I'll start a separate thread on -hackers for that.

Thank you !

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alexander Lakhin 2024-03-04 15:00:00 Re: BUG #18374: Printing memory contexts on OOM condition might lead to segmentation fault
Previous Message Hayato Kuroda (Fujitsu) 2024-03-04 12:52:22 RE: BUG #18369: logical decoding core on AssertTXNLsnOrder()