RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted

From: "egashira(dot)yusuke(at)fujitsu(dot)com" <egashira(dot)yusuke(at)fujitsu(dot)com>
To: 'Kyotaro Horiguchi' <horikyota(dot)ntt(at)gmail(dot)com>
Cc: "bossartn(at)amazon(dot)com" <bossartn(at)amazon(dot)com>, "x4mmm(at)yandex-team(dot)ru" <x4mmm(at)yandex-team(dot)ru>, "hlinnaka(at)iki(dot)fi" <hlinnaka(at)iki(dot)fi>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted
Date: 2021-04-02 11:09:21
Message-ID: OSBPR01MB5029194FC5CAA1400DF842C9FF7A9@OSBPR01MB5029.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Thanks for replying.

> This is the same issue with what is discussed in [1] and recently
> reported by [2].
>
> [1] https://www.postgresql.org/message-id/E63E5670-6CC3-4B09-9686-A77CF94FE4A8%40amazon.com
>
> [2] https://www.postgresql.org/message-id/3f9c466d-d143-472c-a961-66406172af96.mengjuan.cmj@alibaba-inc.com

By reading the suggested thread, I understood that the cause of this problem is that the standby instance was promoted with segment boundary WAL record only partial transferred.
So the primary server's full WAL storage was not necessarily a factor.

Just the other day, a similar crash occurred in the same customer when an intermediate standby in a cascaded replication was promoted even though the WAL was not full in primary.
I think it's the same issue because the logged WAL position at the down point was just the segment boundary.

I applied the suggested threads patch [1] and [2] to master (055fee7eb4dcc78e58672aef146334275e1cc40d) and tested it.

However, with applying the patch [1] only, this abort issue was not solved when I performed my reproduce procedure.

On the other hand, the suggested thread [2]'s patch seems to solve this issue.
With applying the patch [2], the cascade standby server's abort no longer occurred.
The logfile also shows that it prevented the abort while same condition.
I attach the logfile of NODE-C. (debug5)

2021-04-02 19:20:27.430 JST [110914] DEBUG: sending write 0/6000000 flush 0/6000000 apply 0/5FFFFE0 (reply requested)
2021-04-02 19:20:27.430 JST [110914] DEBUG: sendtime 2021-04-02 19:20:27.430755+09 receipttime 2021-04-02 19:20:27.430776+09 replication apply delay 91585 ms transfer latency 1 ms
2021-04-02 19:20:35.932 JST [110914] LOG: replication terminated by primary server
2021-04-02 19:20:35.932 JST [110914] DETAIL: End of WAL reached on timeline 1 at 0/6000000.
2021-04-02 19:20:35.932 JST [110914] LOG: fetching timeline history file for timeline 2 from primary server
2021-04-02 19:20:35.943 JST [110914] DEBUG: walreceiver ended streaming and awaits new instructions
2021-04-02 19:20:35.943 JST [110910] LOG: new target timeline is 2
2021-04-02 19:20:35.943 JST [110910] DEBUG: switched WAL source from stream to archive after failure
2021-04-02 19:20:35.943 JST [110910] DEBUG: could not open file "pg_wal/000000020000000000000006": No such file or directory
2021-04-02 19:20:35.943 JST [110910] DEBUG: switched WAL source from archive to stream after failure
2021-04-02 19:20:35.943 JST [110914] LOG: restarted WAL streaming at 0/6000000 on timeline 2
2021-04-02 19:20:35.943 JST [110914] DEBUG: sendtime 2021-04-02 19:20:35.943432+09 receipttime 2021-04-02 19:20:35.94346+09 replication apply delay 100098 ms transfer latency 1 ms
2021-04-02 19:20:35.943 JST [110914] DEBUG: creating and filling new WAL file
2021-04-02 19:20:36.032 JST [110914] DEBUG: done creating and filling new WAL file
2021-04-02 19:20:36.032 JST [110914] DEBUG: sending write 0/6000038 flush 0/5FFFFE0 apply 0/5FFFFE0
2021-04-02 19:20:36.033 JST [110914] DEBUG: sending write 0/6000038 flush 0/6000038 apply 0/5FFFFE0

=> After the timeline changing, re-reading of the old segment file has occurred, but it has not cause the abort.

A patch in thread [2] appears to be required to fix this issue. I hope that this patch will be committed.

Regards.
Yusuke, Egashira.

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Radoslav Nedyalkov 2021-04-02 13:51:05 standby fails with out-of-order XID insertion upon start
Previous Message Peter Eisentraut 2021-04-01 18:49:45 Re: Since '2001-09-09 01:46:40'::timestamp microseconds are lost when extracting epoch