Re: FSM Corruption (was: Could not read block at end of the relation)

From: Noah Misch <noah(at)leadboat(dot)com>
To: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)
Date: 2024-03-12 22:08:07
Message-ID: 20240312220807.05.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Mar 07, 2024 at 12:21:24PM +0100, Ronan Dunklau wrote:
> Le mercredi 6 mars 2024 19:42:28 CET, vous avez écrit :
> > On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:
> > > Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> > > > I would guess this one is more risky from a performance perspective,
> > > > since
> > > > we'd be adding to a hotter path under RelationGetBufferForTuple().
> > > > Still,
> > > > it's likely fine.
> > >
> > > I ended up implementing this in the attached patch. The idea is that we
> > > detect if the FSM returns a page past the end of the relation, and ignore
> > > it. In that case we will fallback through the extension mechanism.
> > >
> > > For the corrupted-FSM case it is not great performance wise, as we will
> > > extend the relation in small steps every time we find a non existing
> > > block in the FSM, until the actual relation size matches what is recorded
> > > in the FSM. But since those seldom happen, I figured it was better to
> > > keep the code really simple for a bugfix.
> >
> > That could be fine. Before your message, I assumed we would treat this like
> > the case of a full page. That is, we'd call
> > RecordAndGetPageWithFreeSpace(), which would remove the FSM entry and
> > possibly return another. That could be problematic if another backend is
> > extending the relation; we don't want to erase the extender's new FSM
> > entry. I'm parking this topic for the moment.
>
> I would argue this is fine, as corruptions don't happen often, and FSM is not
> an exact thing anyway. If we record a block as full, it will just be unused
> until the next vacuum adds it back to the FSM with a correct value.

If this happened only under FSM corruption, it would be fine. I was thinking
about normal extension, with an event sequence like this:

- RelationExtensionLockWaiterCount() returns 10.
- Backend B1 extends by 10 pages.
- B1 records 9 pages in the FSM.
- B2, B3, ... B11 wake up and fetch from the FSM. In each of those backends,
fsm_does_block_exist() returns false, because the cached relation size is
stale. Those pages remain unused until VACUUM re-records them in the FSM.

PostgreSQL added the multiple-block extension logic (commit 719c84c) from
hard-won experience bottlenecking there. If fixing $SUBJECT will lose 1% of
that change's benefit, that's probably okay. If it loses 20%, we should work
to do better. While we could rerun the 2016 benchmark to see how the patch
regresses it, there might be a simple fix. fsm_does_block_exist() could skip
the cache when blknumber>=cached, like this:

return((BlockNumberIsValid(smgr->smgr_cached_nblocks[MAIN_FORKNUM]) &&
blknumber < smgr->smgr_cached_nblocks[MAIN_FORKNUM]) ||
blknumber < RelationGetNumberOfBlocks(rel));

How do you see it? That still leaves us with an avoidable lseek in B2, B3,
... B11, but that feels tolerable. I don't know how to do better without
switching to the ReadBufferMode approach.

> > The patch currently adds an lseek() whenever the FSM finds a block. I see
> > relevant-looking posts from mailing list searches with subsets of these
> > terms: RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should
> > reduce this cost add. Some ideas:
> >
> > - No need for a system call if the FSM-returned value is low enough with
> > > respect to any prior RelationGetNumberOfBlocks() call.
>
> I'm not sure what you mean by "low enough". What I chose to implement instead

I just meant "less than". Specifically, if "fsm_returned_value < MAX(all
RelationGetNumberOfBlocks() calls done on this relation, since last postmaster
start)" then the FSM-returned value won't have the problem seen in $SUBJECT.

> > - We'd be happy and efficient with a ReadBufferMode such that
> > ReadBufferExtended() returns NULL after a 0-byte read, with all other
> > errors handled normally. That sounds like the specification of
> > RBM_NORMAL_NO_LOG. Today's implementation of RBM_NORMAL_NO_LOG isn't that,
> > but perhaps one RBM could satisfy both sets of needs.
>
> I'm not sure about this one, this seems to have far more reaching
> consequences. This would mean we don't have any cooperation from the FSM, and
> would need to implement error recovery scenarios in each caller. But I admit I
> haven't looked too much into it, and focused instead in seeing if the current
> approach can get us to an acceptable state.

I, too, am not sure. I agree that RBM approach wouldn't let you isolate the
changes like the last patch does. Each GetFreeIndexPage() caller would need
code for the return-NULL case. Avoiding that is nice. (If we ever do the RBM
change in the future, the fsm_readbuf() and vm_readbuf() !extend cases should
also use it.)

> --- a/src/backend/storage/freespace/freespace.c
> +++ b/src/backend/storage/freespace/freespace.c

> + if (addr.level == FSM_BOTTOM_LEVEL) {
> + BlockNumber blkno = fsm_get_heap_blk(addr, slot);
> + Page page;
> + /*
> + * If the buffer is past the end of the relation,
> + * update the page and restarts from the root
> + */
> + if (fsm_does_block_exist(rel, blkno))
> + {
> + ReleaseBuffer(buf);
> + return blkno;
> + }
> + page = BufferGetPage(buf);
> + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE);
> + fsm_set_avail(page, slot, 0);
> + MarkBufferDirtyHint(buf, false);

I wondered if we should zero all slots on the same FSM page that correspond to
main fork blocks past the end of the main fork. The usual "advancenext"
behavior is poor for this case, since every later slot is invalid in same way.
My current thinking: no, it's not worth the code bulk to do better. A single
extension caps at 64 blocks, so the worst case is roughly locking the buffer
64 times and restarting from FSM root 64 times. For something this
infrequent, that's not bad.

Thanks,
nm

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message PG Bug reporting form 2024-03-13 02:57:57 BUG #18390: exponentiation produces float datatype, but nth-root produces integer
Previous Message Noah Misch 2024-03-12 19:55:17 Re: BUG #18389: pg_database_owner not recognized with alter default privileges