[Bug Fix]standby may crash when switching-over in certain special cases

From: px shi <spxlyy123(at)gmail(dot)com>
To: pgsql-hackers(at)postgresql(dot)org
Subject: [Bug Fix]standby may crash when switching-over in certain special cases
Date: 2024-08-19 08:43:09
Message-ID: CAAccyYKrRojjO-weeXFs1EqLFHWSBfjBzObwVDv4u-ZxmU=7Rg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi, hackers,

I've recently encountered an issue where a standby crashes when
reconnecting to a new primary after a switchover under certain conditions.
Here's a procedure of the crash scenario:

1) We have three instances: one primary and two standbys (s1 and s2, both
using streaming replication).

2) The primary crashed when the standby’s `flushed_lsn` was at the
beginning of a WAL segment (e.g., `0/22000000`). Both s1 and s2 logged the
following:

```

FATAL: could not connect to the primary server...

LOG: waiting for WAL to become available at 0/220000ED

```

3) s1 was promoted to the new primary, s1 logged the following:

```

LOG: received promote request

LOG: redo done at 0/21FFFEE8

LOG: selected new timeline ID: 2

```

4) s2's `primary_conninfo` was updated to point to s1, s2 logged the
following:

```

LOG: received SIGHUP, reloading configuration files

LOG: parameter "primary_conninfo" changed to ...

```

5) s2 began replication with s1 and attempted to fetch `0/22000000` on
timeline 2, s2 logged the following:

```

LOG: fetching timeline history file for timeline 2 from primary server

FATAL: could not start WAL streaming: ERROR: requested starting point
0/22000000 on timeline 1 is not this server's history

DETAIL: This server's history forked from timeline 1 at 0/21FFFFE0.

LOG: started streaming WAL from primary at 0/22000000 on timeline 2

```

6) WAL record mismatch caused the walreceiver process to terminate, s2
logged the following:

```

LOG: invalid contrecord length 10 (expected 213) at 0/21FFFFE0

FATAL: terminating walreceiver process due to administrator command

```

7) s2 then attempted to fetch `0/21000000` on timeline 2. However, the
startup process failed to open the WAL file before it was created, leading
to a crash:

```

PANIC: could not open file "pg_wal/000000020000000000000021": No such
file or directory

LOG: startup process was terminated by signal 6: Aborted

```

In this scenario, s2 attempts replication twice. First, it starts from
`0/22000000` on timeline 2, setting `walrcv->flushedUpto` to `0/22000000`.
But when a mismatch occurs, the walreceiver process is terminated. On the
second attempt, replication starts from `0/21000000` on timeline 2. The
startup process expects the WAL file to exist because WalRcvFlushRecPtr()
return `0/22000000`, but the file is not found, causing s2's startup
process to crash.

I think it should check recptr and walrcv->flushedUpto when Request
XLogStreming, so I create a patch for it.

Best regards,
pixian shi

Attachment Content-Type Size
0001-Fix-walreceiver-set-incorrect-flushedUpto-when-switc.patch application/octet-stream 1.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrei Lepikhov 2024-08-19 08:43:35 Re: Asymmetric partition-wise JOIN
Previous Message Jakub Wartak 2024-08-19 08:07:40 debug_palloc_context_threshold - dump code paths leading to memory leaks