Re: [Bug Fix]standby may crash when switching-over in certain special cases

From: Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>
To: px shi <spxlyy123(at)gmail(dot)com>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: [Bug Fix]standby may crash when switching-over in certain special cases
Date: 2024-08-20 16:49:20
Message-ID: 20240821014920.e895c9f27217ad3d533ffa0b@sraoss.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, 19 Aug 2024 16:43:09 +0800
px shi <spxlyy123(at)gmail(dot)com> wrote:

> Hi, hackers,
>
> I've recently encountered an issue where a standby crashes when
> reconnecting to a new primary after a switchover under certain conditions.
> Here's a procedure of the crash scenario:
>
>
> 1) We have three instances: one primary and two standbys (s1 and s2, both
> using streaming replication).
>
>
> 2) The primary crashed when the standby’s `flushed_lsn` was at the
> beginning of a WAL segment (e.g., `0/22000000`). Both s1 and s2 logged the
> following:
>
> ```
>
> FATAL: could not connect to the primary server...
>
> LOG: waiting for WAL to become available at 0/220000ED
>
> ```
>
>
> 3) s1 was promoted to the new primary, s1 logged the following:
>
> ```
>
> LOG: received promote request
>
> LOG: redo done at 0/21FFFEE8
>
> LOG: selected new timeline ID: 2
>
> ```
>
>
> 4) s2's `primary_conninfo` was updated to point to s1, s2 logged the
> following:
>
> ```
>
> LOG: received SIGHUP, reloading configuration files
>
> LOG: parameter "primary_conninfo" changed to ...
>
> ```
>
>
> 5) s2 began replication with s1 and attempted to fetch `0/22000000` on
> timeline 2, s2 logged the following:
>
> ```
>
> LOG: fetching timeline history file for timeline 2 from primary server
>
> FATAL: could not start WAL streaming: ERROR: requested starting point
> 0/22000000 on timeline 1 is not this server's history
>
> DETAIL: This server's history forked from timeline 1 at 0/21FFFFE0.
>
> LOG: started streaming WAL from primary at 0/22000000 on timeline 2
>
> ```
>
>
> 6) WAL record mismatch caused the walreceiver process to terminate, s2
> logged the following:
>
> ```
>
> LOG: invalid contrecord length 10 (expected 213) at 0/21FFFFE0
>
> FATAL: terminating walreceiver process due to administrator command
>
> ```
>
>
> 7) s2 then attempted to fetch `0/21000000` on timeline 2. However, the
> startup process failed to open the WAL file before it was created, leading
> to a crash:
>
> ```
>
> PANIC: could not open file "pg_wal/000000020000000000000021": No such
> file or directory
>
> LOG: startup process was terminated by signal 6: Aborted
>
> ```
>
>
> In this scenario, s2 attempts replication twice. First, it starts from
> `0/22000000` on timeline 2, setting `walrcv->flushedUpto` to `0/22000000`.
> But when a mismatch occurs, the walreceiver process is terminated. On the
> second attempt, replication starts from `0/21000000` on timeline 2. The
> startup process expects the WAL file to exist because WalRcvFlushRecPtr()
> return `0/22000000`, but the file is not found, causing s2's startup
> process to crash.
>
> I think it should check recptr and walrcv->flushedUpto when Request
> XLogStreming, so I create a patch for it.

Is s1 a cascading standby of s2? If otherwise s1 and s2 is the standbys of
the primary server respectively, it is not surprising that s2 has progressed
far than s1 when the primary fails. I believe that this is the case you should
use pg_rewind. Even if flushedUpto is reset as proposed in your patch, s2 might
already have applied a WAL record that s1 has not processed yet, and there
would be no gurantee that subsecuent applys suceed.

Regards,
Yugo Nagata

--
Yugo Nagata <nagata(at)sraoss(dot)co(dot)jp>

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2024-08-20 16:50:37 Re: Partial aggregates pushdown
Previous Message Robert Haas 2024-08-20 16:44:05 Re: Add new protocol message to change GUCs for usage with future protocol-only GUCs