Quick Links

Re: BUG #17928: Standby fails to decode WAL on termination of primary

From:	Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To:	Michael Paquier <michael(at)paquier(dot)xyz>
Cc:	Alexander Lakhin <exclusion(at)gmail(dot)com>, Sergei Kornilov <sk(at)zsrv(dot)org>, Noah Misch <noah(at)leadboat(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org, Fujii Masao <masao(dot)fujii(at)oss(dot)nttdata(dot)com>
Subject:	Re: BUG #17928: Standby fails to decode WAL on termination of primary
Date:	2023-09-23 20:48:42
Message-ID:	CA+hUKGJf3Hhb2MB88-rW2di2H9XT0xr6-hd6ZjGEwdJs3A=b+Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

On Sat, Sep 23, 2023 at 4:44 PM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> The stack may point out at a different issue, but perhaps this is a
> matter where we're returning now XLREAD_SUCCESS where previously we
> had XLREAD_FAIL, causing this code to fail thinking that the block was
> valid while it's not?

"grison" has a little more detail -- we see
pg_comp_crc32c_sb8(len=4294636456). I'm wondering how to reproduce
this, but among the questions that jump out I have: why was it ever OK
that we load record->xl_tot_len into total_len, perform header
validation, determine that total_len < len (= this record is all on
one page, no reassembly loop needed, so now we're in the single-page
branch), then call ReadPageInternal() again, then call
ValidXLogRecord() which internally loads record->xl_tot_len *again*?
ReadPageInternal() might have changed xl_tot_len, no? That seems to
be a possible pathway to reading past the end of the buffer in the CRC
check, no?

If that value didn't change underneath us, I think we'd need an
explanation for how we finished up in the single-page branch at
xlogreader.c:842 with a large xl_tot_len, which I'm not seeing yet,
though it might take more coffee. (Possibly supporting the re-read
theory is the fact that it's only happening on a few very slow
computers, though I have no idea why it would only happen on master
[so far at least].)

In response to

Re: BUG #17928: Standby fails to decode WAL on termination of primary at 2023-09-23 04:44:18 from Michael Paquier

Responses

Re: BUG #17928: Standby fails to decode WAL on termination of primary at 2023-09-23 23:23:43 from Michael Paquier

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Heikki Linnakangas	2023-09-23 22:06:35	Re: BUG #18129: GiST index produces incorrect query results
Previous Message	PG Bug reporting form	2023-09-23 20:43:28	BUG #18132: llvm-jit does not build with LLVM 17