From: | px shi <spxlyy123(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | [Bug Fix]standby may crash when switching-over in certain special cases |
Date: | 2024-08-19 08:43:09 |
Message-ID: | CAAccyYKrRojjO-weeXFs1EqLFHWSBfjBzObwVDv4u-ZxmU=7Rg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi, hackers,
I've recently encountered an issue where a standby crashes when
reconnecting to a new primary after a switchover under certain conditions.
Here's a procedure of the crash scenario:
1) We have three instances: one primary and two standbys (s1 and s2, both
using streaming replication).
2) The primary crashed when the standby’s `flushed_lsn` was at the
beginning of a WAL segment (e.g., `0/22000000`). Both s1 and s2 logged the
following:
```
FATAL: could not connect to the primary server...
LOG: waiting for WAL to become available at 0/220000ED
```
3) s1 was promoted to the new primary, s1 logged the following:
```
LOG: received promote request
LOG: redo done at 0/21FFFEE8
LOG: selected new timeline ID: 2
```
4) s2's `primary_conninfo` was updated to point to s1, s2 logged the
following:
```
LOG: received SIGHUP, reloading configuration files
LOG: parameter "primary_conninfo" changed to ...
```
5) s2 began replication with s1 and attempted to fetch `0/22000000` on
timeline 2, s2 logged the following:
```
LOG: fetching timeline history file for timeline 2 from primary server
FATAL: could not start WAL streaming: ERROR: requested starting point
0/22000000 on timeline 1 is not this server's history
DETAIL: This server's history forked from timeline 1 at 0/21FFFFE0.
LOG: started streaming WAL from primary at 0/22000000 on timeline 2
```
6) WAL record mismatch caused the walreceiver process to terminate, s2
logged the following:
```
LOG: invalid contrecord length 10 (expected 213) at 0/21FFFFE0
FATAL: terminating walreceiver process due to administrator command
```
7) s2 then attempted to fetch `0/21000000` on timeline 2. However, the
startup process failed to open the WAL file before it was created, leading
to a crash:
```
PANIC: could not open file "pg_wal/000000020000000000000021": No such
file or directory
LOG: startup process was terminated by signal 6: Aborted
```
In this scenario, s2 attempts replication twice. First, it starts from
`0/22000000` on timeline 2, setting `walrcv->flushedUpto` to `0/22000000`.
But when a mismatch occurs, the walreceiver process is terminated. On the
second attempt, replication starts from `0/21000000` on timeline 2. The
startup process expects the WAL file to exist because WalRcvFlushRecPtr()
return `0/22000000`, but the file is not found, causing s2's startup
process to crash.
I think it should check recptr and walrcv->flushedUpto when Request
XLogStreming, so I create a patch for it.
Best regards,
pixian shi
Attachment | Content-Type | Size |
---|---|---|
0001-Fix-walreceiver-set-incorrect-flushedUpto-when-switc.patch | application/octet-stream | 1.3 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Andrei Lepikhov | 2024-08-19 08:43:35 | Re: Asymmetric partition-wise JOIN |
Previous Message | Jakub Wartak | 2024-08-19 08:07:40 | debug_palloc_context_threshold - dump code paths leading to memory leaks |