Re: WAL record CRC calculated incorrectly because of underlying buffer modification

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Michael Paquier <michael(at)paquier(dot)xyz>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, Jeff Davis <pgsql(at)j-davis(dot)com>
Subject: Re: WAL record CRC calculated incorrectly because of underlying buffer modification
Date: 2024-05-12 23:15:03
Message-ID: CA+hUKG+=cb86CYa4W42z4wFBMwjQE2=O9RFC+i4QZuCB+d2p0A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Sat, May 11, 2024 at 5:00 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> 11.05.2024 07:25, Thomas Munro wrote:
> > On Sat, May 11, 2024 at 4:00 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >> 11.05.2024 06:26, Thomas Munro wrote:
> >>> Perhaps a no-image, no-change registered buffer should not be
> >>> including an image, even for XLR_CHECK_CONSISTENCY? It's actually
> >>> useless for consistency checking too I guess, this issue aside,
> >>> because it doesn't change anything so there is nothing to check.

> >> Yes, I think something wrong is here. I've reduced the reproducer to:

> > Does it reproduce if you do this?
> >
> > - include_image = needs_backup || (info &
> > XLR_CHECK_CONSISTENCY) != 0;
> > + include_image = needs_backup ||
> > + ((info & XLR_CHECK_CONSISTENCY) != 0 &&
> > + (regbuf->flags & REGBUF_NO_CHANGE) == 0);
>
> No, it doesn't (at least with the latter, more targeted reproducer).

OK so that seems like a candidate fix, but ...

> > Unfortunately the back branches don't have that new flag from 00d7fb5e
> > so, even if this is the right direction (not sure, I don't understand
> > this clean registered buffer trick) then ... but wait, why are there
> > are no failures like this in the back branches (yet at least)? Does
> > your reproducer work for 16? I wonder if something relevant changed
> > recently, like f56a9def. CC'ing Michael and Amit K for info.
>
> Maybe it's hard to hit (autovacuum needs to process the index page in a
> narrow time frame), but locally I could reproduce the issue even on
> ac27c74de(~1 too) from 2018-09-06 (I tried several last commits touching
> hash indexes, didn't dig deeper).

... we'd need to figure out how to fix this in the back-branches too.
One idea would be to back-patch REGBUF_NO_CHANGE, and another might be
to deduce that case from other variables. Let me CC a couple more
people from this thread, which most recently hacked on this stuff, to
see if they have insights:

https://www.postgresql.org/message-id/flat/d2c31606e6bb9b83a02ed4835d65191b38d4ba12.camel%40j-davis.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2024-05-12 23:29:23 Re: race condition in pg_class
Previous Message Michael Paquier 2024-05-12 23:02:02 Re: Weird test mixup