Re: "invalid contrecord" error on replica

From: Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: Francois(dot)JOULAUD(at)radiofrance(dot)com, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: "invalid contrecord" error on replica
Date: 2021-05-06 19:21:18
Message-ID: d3374925-79dc-fd0d-be9f-47fb4f967804@anayrat.info
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 5/6/21 7:37 AM, Kyotaro Horiguchi wrote:
> At Sun, 2 May 2021 22:43:44 +0200, Adrien Nayrat <adrien(dot)nayrat(at)anayrat(dot)info> wrote in
>> I also dumped 00000001000000AA000000A1 on the secondary and it
>> contains all the records until AA/A1004018.
>>
>> It is really weird, I don't understand how the secondary can miss the
>> last 2 records of A0? It seems he did not received the
>> CHECKPOINT_SHUTDOWN record?
>>
>> Any idea?
>
> This seems like stepping on the same issue with [1], in short, the
> secondary having received an incomplete record but the primary forgot
> of the record after restart.
>
> Specifically, primary was writing a WAL record that starts at A0FFFB70
> and continues to A1xxxxxx segment. The secondary successfully received
> the first half of the record but the primary failed to write (then
> send) the last half of the record due to disk full.
>
> At this time it seems that the primary's last completed record ended
> at A0FFB70. Then the CHECKPOINT_SHUTDOWN record overwrote the
> already-halfly-sent record up to A0FFBE8 while restarting.
>
> On the secondary side, there's only the first half of the record,
> which had been forgotten by the primary and the last half starting at
> LSN A1000000 was still the future in the new history on the primary.
>
> After some time the primary reaches A1000000 but the first record in
> the segment is of course disagrees with the history of the secondary.
>
> 1: https://www.postgresql.org/message-id/CBDDFA01-6E40-46BB-9F98-9340F4379505%40amazon.com
>
> regards.
>

Hello,

Thanks for your reply and your explanation! Now, I understand, it's good to know
it is a known issue.
I'll follow this thread, I hope we will find a solution. It's annoying that your
secondary breaks when your primary crash and the only solution is to either
fetch an archived WAL file and replace it on the secondary, or completely
rebuild your secondary.

Thanks

--
Adrien NAYRAT

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Droid Tools 2021-05-07 00:42:33 Optimizing search query with sorting by creation field
Previous Message Tom Lane 2021-05-06 19:14:13 Re: Strange behavior of function date_trunc