Quick Links

Re: FSM Corruption (was: Could not read block at end of the relation)

From:	Noah Misch <noah(at)leadboat(dot)com>
To:	Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc:	Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject:	Re: FSM Corruption (was: Could not read block at end of the relation)
Date:	2024-03-06 18:42:28
Message-ID:	20240306184228.a7.nmisch@google.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Wed, Mar 06, 2024 at 10:31:17AM +0100, Ronan Dunklau wrote:
> Le mardi 5 mars 2024, 00:05:03 CET Noah Misch a écrit :
> > I would guess this one is more risky from a performance perspective, since
> > we'd be adding to a hotter path under RelationGetBufferForTuple(). Still,
> > it's likely fine.
>
> I ended up implementing this in the attached patch. The idea is that we detect
> if the FSM returns a page past the end of the relation, and ignore it.
> In that case we will fallback through the extension mechanism.
>
> For the corrupted-FSM case it is not great performance wise, as we will extend
> the relation in small steps every time we find a non existing block in the FSM,
> until the actual relation size matches what is recorded in the FSM. But since
> those seldom happen, I figured it was better to keep the code really simple for
> a bugfix.

That could be fine. Before your message, I assumed we would treat this like
the case of a full page. That is, we'd call RecordAndGetPageWithFreeSpace(),
which would remove the FSM entry and possibly return another. That could be
problematic if another backend is extending the relation; we don't want to
erase the extender's new FSM entry. I'm parking this topic for the moment.

> I wanted to test the impact in terms of performance, and I thought about the
> worst possible case for this.
>
[test details]
>
> I noticed no difference running with or without the patch, but maybe someone
> else can try to run that or find another adversarial case ?

The patch currently adds an lseek() whenever the FSM finds a block. I see
relevant-looking posts from mailing list searches with subsets of these terms:
RelationGetNumberOfBlocks lseek benchmark overhead. I bet we should reduce
this cost add. Some ideas:

- No need for a system call if the FSM-returned value is low enough with
respect to any prior RelationGetNumberOfBlocks() call.

- We'd be happy and efficient with a ReadBufferMode such that
ReadBufferExtended() returns NULL after a 0-byte read, with all other errors
handled normally. That sounds like the specification of RBM_NORMAL_NO_LOG.
Today's implementation of RBM_NORMAL_NO_LOG isn't that, but perhaps one RBM
could satisfy both sets of needs.

On Wed, Mar 06, 2024 at 12:08:38PM +0100, Ronan Dunklau wrote:
> Subject: [PATCH 2/2] Detect invalid FSM when finding a block.
>
> Whenever the FSM points to a block past the end of the relation, detect
> it and fallback to the relation extension mechanism.
>
> This may arise when a FPI for the FSM has been triggered, and we end up
> WAL-logging a page of the FSM pointing to newly extended blocks in the
> relation which have never been WAL-logged.
> ---
> src/backend/access/heap/hio.c | 20 +++++++++++++++++++-

Each GetPageWithFreeSpace() caller would need this, not just heap. Please
evaluate whether this logic belongs inside freespace.c vs. in each caller.
freespace.c comments and/or freespace/README may need updates. Please
evaluate header comments of freespace.c functions whose callers you are
changing, and evaluate comments about truncation.

In response to

Re: FSM Corruption (was: Could not read block at end of the relation) at 2024-03-06 09:31:17 from Ronan Dunklau

Responses

Re: FSM Corruption (was: Could not read block at end of the relation) at 2024-03-07 11:21:24 from Ronan Dunklau

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2024-03-06 19:52:28	Re: BUG #18379: LDAP bind password exposed
Previous Message	Stephen Frost	2024-03-06 18:40:26	Re: BUG #18379: LDAP bind password exposed