From: | Amul Sul <sulamul(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Andres Freund <andres(at)anarazel(dot)de>, Andrew Dunstan <andrew(at)dunslane(dot)net>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, "masao(dot)fujii(at)oss(dot)nttdata(dot)com" <masao(dot)fujii(at)oss(dot)nttdata(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "mengjuan(dot)cmj(at)alibaba-inc(dot)com" <mengjuan(dot)cmj(at)alibaba-inc(dot)com>, "Jakub(dot)Wartak(at)tomtom(dot)com" <Jakub(dot)Wartak(at)tomtom(dot)com>, Ryo Matsumura <matsumura(dot)ryo(at)fujitsu(dot)com> |
Subject: | Re: prevent immature WAL streaming |
Date: | 2021-11-25 06:08:42 |
Message-ID: | CAAJ_b97KyJ6X9uO8KH31zn1vrcNscmHFUeE8+AFAzPqQPAmszw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Nov 24, 2021 at 2:10 AM Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> wrote:
>
> On 2021-Nov-23, Tom Lane wrote:
>
> > We're *still* not out of the woods with 026_overwrite_contrecord.pl,
> > as we are continuing to see occasional "mismatching overwritten LSN"
> > failures, further down in the test where it tries to start up the
> > standby:
>
> Augh.
>
> > Looking at adjacent successful runs, it seems that the exact point
> > where the "missing contrecord" starts varies substantially, even after
> > our previous fix to disable autovacuum in this test. How could that be?
>
> Well, there is intentionally some variability. Maybe not as much as one
> would wish, but I expect that that should explain why that point is not
> always the same.
>
> > It's probably for the best though, because I think this is exposing
> > an actual bug that we would not have seen if the start point were
> > completely consistent. I have not dug into the code, but it looks to
> > me like if the "consistent recovery state" is reached exactly at a
> > page boundary (0/1FFE000 in all these cases), then the standby expects
> > that to be what the OVERWRITE_CONTRECORD record will point at. But
> > actually it points to the first WAL record on that page, resulting
> > in a bogus failure.
>
> So what is happening is that we set state->overwrittenRecPtr to the LSN
> of page start, ignoring the page header. Is that the LSN of the first
> record in a page? I'll see if I can reproduce the problem.
>
In XLogReadRecord(), both the variables being compared have
inconsistency in the assignment -- one gets assigned from
state->currRecPtr where other is from RecPtr.
.....
state->overwrittenRecPtr = state->currRecPtr;
.....
state->abortedRecPtr = RecPtr;
.....
Before the place where assembled flag sets, there is a bunch of code
that adjusts RecPtr. I think instead of RecPtr, the latter assignment
should use state->currRecPtr as well.
Regards,
Amul
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2021-11-25 06:46:52 | Re: [BUG]Missing REPLICA IDENTITY check when DROP NOT NULL |
Previous Message | Bharath Rupireddy | 2021-11-25 05:52:11 | Re: pg_dump, pg_basebackup don't error out with wrong option for "--format" |