Re: FSM Corruption (was: Could not read block at end of the relation)

From: Noah Misch <noah(at)leadboat(dot)com>
To: Ronan Dunklau <ronan(dot)dunklau(at)aiven(dot)io>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-bugs <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: FSM Corruption (was: Could not read block at end of the relation)
Date: 2024-03-16 04:58:34
Message-ID: 20240316045834.83@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Mar 15, 2024 at 08:03:39PM -0700, Noah Misch wrote:
> On Fri, Mar 15, 2024 at 07:23:46AM +0100, Ronan Dunklau wrote:
> > Le mercredi 13 mars 2024, 17:55:23 CET Noah Misch a écrit :
> > > On Wed, Mar 13, 2024 at 10:37:21AM +0100, Ronan Dunklau wrote:
> > > > I'm a bit worried about triggering additional lseeks
> > > > when we in fact have other free pages in the map, that would pass the
> > > > test. I'm not sure how relevant it is given the way we search the FSM
> > > > with fp_next_slot though...
> > >
> > > That's a reasonable thing to worry about. We could do wrong by trying too
> > > hard to use an FSM slot, and we could do wrong by not trying hard enough.
> > >
> > > > To address that, I've given a bit of thought about enabling / disabling
> > > > the
> > > > auto-repair behaviour with a flag in GetPageWithFreeSpace to distinguish
> > > > the cases where we know we have a somewhat up-to-date value compared to
> > > > the case where we don't (as in, for heap, try without repair, then get an
> > > > uptodate value to try the last block, and if we need to look at the FSM
> > > > again then ask for it to be repaired) but it brings way too much
> > > > complexity and would need careful thought for each AM.
> > > >
> > > > So please find attached a patch with the change you propose.
> > > >
> > > > Do you have a link to the benchmark you mention though to evaluate the
> > > > patch against it ?
> > >
> > > Here is one from the thread that created commit 719c84c:
> > > https://www.postgresql.org/message-id/flat/CA%2BTgmob7xED4AhoqLspSOF0wCMYEom
> > > gHfuVdzNJnwWVoE_c60g%40mail.gmail.com
> > >
> > > There may be other benchmarks earlier in that thread.
> >
> > Thank you. I tried to run that benchmark, with 4 clients all copying 10M
> > pgbench_accounts row concurrently. With and without the patch, I ran the
> > benchmark 5 times, and took the average.
>
> Does "that benchmark" refer to "unlogged tables, 4 parallel copies", "logged
> tables, 4 parallel copies", or something else?
>
> > master: 15.05s
> > patched: 15.24s (+1.25%)
> >
> > If I remove the best and worst run for each of those, the difference falls at
> > +0.7%.
>
> To get some additional perspective on the benchmark, how hard would it be to
> run one or both of the following? Feel free to decline if difficult.
>
> - Make GetPageWithFreeSpace() just "return InvalidBlockNumber", then rerun the
> benchmark. This should be a lot slower. If not, the bottleneck is
> somewhere unexpected, and we'd need a different benchmark.
>
> - Get profiles with both master and patched. (lseek or freespace.c functions
> rising by 0.1%-1% would fit what we know.)

Forgot one more:

- Your earlier version of the patch, with fewer lseek() but more disuse of FSM
entries.

> Distinguishing a 1% change from a 0% change would need different techniques
> still. If we're genuinely slowing bulk loads by ~1% to account for the
> possibly of a flawed post-recovery FSM, that's sad, but I'm inclined to accept
> the loss. A persistent error in an innocent INSERT is unacceptable, and the
> alternatives we discussed upthread have their own substantial trouble. Other
> opinions?
>
> Thanks,
> nm

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2024-03-16 11:51:59 Re: Re:RE: Re:BUG #18369: logical decoding core on AssertTXNLsnOrder()
Previous Message Noah Misch 2024-03-16 03:03:39 Re: FSM Corruption (was: Could not read block at end of the relation)