Re: BUG #17928: Standby fails to decode WAL on termination of primary

From: Noah Misch <noah(at)leadboat(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, thomas(dot)munro(at)gmail(dot)com, exclusion(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date: 2023-08-11 14:00:08
Message-ID: 20230811140008.GB2261449@rfd.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Aug 11, 2023 at 03:08:26PM +0900, Michael Paquier wrote:
> On Thu, Aug 10, 2023 at 07:58:08PM -0700, Noah Misch wrote:
> > On Thu, Aug 10, 2023 at 04:45:25PM +0900, Michael Paquier wrote:
> >> Good idea to pollute the data with recycled segments. Using a minimal
> >> WAL segment size whould help here as well in keeping a test cheap, and
> >> two segments should be enough. The alignment calculations and the
> >> header size can be known, but the standby records are an issue for the
> >> predictability of the test when it comes to adjust the length of the
> >> logical message depending on the 8k WAL page, no?
> >
> > Could be. I expect there would be challenges translating that outline into a
> > real test, but I don't know if that will be a major one. The test doesn't
> > need to be 100% deterministic. If it fails 25% of the time and is not the
> > slowest test in the recovery suite, I'd find that good enough.
>
> FWIW, I'm having a pretty hard time to get something close enough to a
> page border in a reliable way. Perhaps using a larger series of
> records and select only one would be more reliable.. Need to test
> that a bit more.

Interesting. So pg_logical_emit_message(false, 'X', repeat('X', n)) doesn't
get close enough, but s/n/n+1/ spills to the next page? If so, I did not
anticipate that.

> >> FWIW, I came back to this thread while tweaking the error reporting of
> >> xlogreader.c for the sake of this thread and this proposal is an
> >> improvement to be able to make a distinction between an OOM and an
> >> incorrect record:
> >> https://www.postgresql.org/message-id/ZMh/WV+CuknqePQQ(at)paquier(dot)xyz
> >>
> >> Anyway, agreed that it's an improvement to remove this check out of
> >> allocate_recordbuf(). Noah, are you planning to work more on that?
> >
> > I can push xl_tot_len-validate-v1.patch, particularly given the testing result
> > you reported today. I'm content for my part to stop there.
>
> Okay, fine by me. That's going to help with what I am doing in the
> other thread as I'd need to make a better difference between the OOM
> and the invalid cases for the allocation path.
>
> You are planning for a backpatch to take care of the inconsistency,
> right? The report has been on 15~ where the prefetching was
> introduced. I'd be OK to just do that and not mess up with the stable
> branches more than necessary (aka ~14) if nobody complains, especially
> REL_11_STABLE planned to be EOL'd in the next minor cycle.

I recall earlier messages theorizing that it was just harder to hit in v14, so
I'm disinclined to stop at v15. I think the main choice is whether to stop at
v11 (normal choice) or v12 (worry about breaking the last v11 point release).
I don't have a strong opinion between those.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andrew Dunstan 2023-08-11 20:37:23 Re: BUG #18040: PostgreSQL does not report its version correctly
Previous Message Michael Paquier 2023-08-11 06:08:26 Re: BUG #17928: Standby fails to decode WAL on termination of primary