Streaming replication: PANIC on tertiary when secondary promoted

From: Alexey Bashtanov <bashtanov(at)imap(dot)cc>
To: PostgreSQL General <pgsql-general(at)postgresql(dot)org>
Subject: Streaming replication: PANIC on tertiary when secondary promoted
Date: 2021-06-16 15:27:35
Message-ID: 887ed598-ccc4-4a82-0be9-f5065a026f2c@imap.cc
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

I had a cascade serverA->serverB->serverC->serverD of Postgres 10.14
servers connected with streaming replication.
There was no archive shipping set up, but there was an empty directory
/data/pg_archive/10/dedupe_shard1_10/ mentioned in config for it on each
of the servers.

When I promoted serverB, serverC crashed:

Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-1]
2021-06-16 14:45:43.717 UTC [43934] {-} LOG:  replication terminated by
primary server
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [7-2]
2021-06-16 14:45:43.717 UTC [43934] {-} DETAIL:  End of WAL reached on
timeline 1 at 190B0/36000000.
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [8-1]
2021-06-16 14:45:43.717 UTC [43934] {-} LOG:  fetching timeline history
file for timeline 2 from primary server
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history':
No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history':
No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history':
No such file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[33222]: [9-1]
2021-06-16 14:45:43.736 UTC [33222] {-} LOG:  new target timeline is 2
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such
file or directory
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such
file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [9-1]
2021-06-16 14:45:43.746 UTC [43934] {-} LOG:  restarted WAL streaming at
190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[43934]: [10-1]
2021-06-16 14:45:43.746 UTC [43934] {-} FATAL:  could not receive data
from WAL stream: ERROR:  requested starting point 190B0/36000000 is
ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history':
No such file or directory
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such
file or directory
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [7-1]
2021-06-16 14:45:43.764 UTC [93365] {-} LOG:  started streaming WAL from
primary at 190B0/36000000 on timeline 2
Jun 16 14:45:43 serverC postgresql-10-dedupe_shard1_10[93365]: [8-1]
2021-06-16 14:45:43.764 UTC [93365] {-} FATAL:  could not receive data
from WAL stream: ERROR:  requested starting point 190B0/36000000 is
ahead of the WAL flush position of this server 190B0/35FFFFE8
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history':
No such file or directory
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such
file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [7-1]
2021-06-16 14:45:48.771 UTC [93421] {-} LOG:  started streaming WAL from
primary at 190B0/36000000 on timeline 2
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [10-1]
2021-06-16 14:45:48.792 UTC [33222] {-} LOG:  invalid contrecord length
1585 at 190B0/35FFFFB8
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[93421]: [8-1]
2021-06-16 14:45:48.793 UTC [93421] {-} FATAL:  terminating walreceiver
process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history':
No such file or directory
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such
file or directory
Jun 16 14:45:48 serverC postgresql-10-dedupe_shard1_10[33222]: [11-1]
2021-06-16 14:45:48.803 UTC [33222] {-} PANIC:  could not open file
"pg_wal/00000002000190B000000035": No such file or directory

After I started it it seems to have caught up:

Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100187]: [8-1]
2021-06-16 14:59:09.826 UTC [100187] {-} LOG:  consistent recovery state
reached at 190B0/35FFFFB8
Jun 16 14:59:09 serverB postgresql-10-dedupe_shard1_10[100172]: [5-1]
2021-06-16 14:59:09.827 UTC [100172] {-} LOG:  database system is ready
to accept read only connections
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such
file or directory
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[100187]: [9-1]
2021-06-16 14:59:10.281 UTC [100187] {-} LOG:  invalid record length at
190B0/36000898: wanted 24, got 0
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101331]: [6-1]
2021-06-16 14:59:10.292 UTC [101331] {-} LOG:  started streaming WAL
from primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101332]: [6-1]
2021-06-16 14:59:10.332 UTC [101332] {[unknown]-[local]}
[unknown](at)[unknown] LOG:  incomplete startup packet
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-1]
2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)}
replication(at)[unknown] ERROR:  requested starting point 190B0/36000000 on
timeline 1 is not in this server's history
Jun 16 14:59:10 serverB postgresql-10-dedupe_shard1_10[101334]: [6-2]
2021-06-16 14:59:10.508 UTC [101334] {walreceiver-10.1.10.12(43648)}
replication(at)[unknown] DETAIL:  This server's history forked from
timeline 1 at 190B0/35FFFFB8.
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [6-1]
2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)}
replication(at)[unknown] LOG:  could not receive data from client:
Connection reset by peer
Jun 16 14:59:11 serverB postgresql-10-dedupe_shard1_10[101336]: [7-1]
2021-06-16 14:59:11.509 UTC [101336] {walreceiver-10.1.10.12(43652)}
replication(at)[unknown] LOG:  unexpected EOF on standby connection

But now serverD crashed in a similar fashion:

Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [6-1]
2021-06-16 14:59:10.500 UTC [29127] {-} LOG:  fetching timeline history
file for timeline 2 from primary server
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-1]
2021-06-16 14:59:10.509 UTC [29127] {-} FATAL:  could not start WAL
streaming: ERROR:  requested starting point 190B0/36000000 on timeline 1 is
 not in this server's history
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29127]: [7-2]
#011DETAIL: This server's history forked from timeline 1 at 190B0/35FFFFB8.
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history':
No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history':
No such file or directory
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000002.history':
No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [9-1]
2021-06-16 14:59:10.524 UTC [32792] {-} LOG:  new target timeline is 2
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000036': No such
file or directory
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000001000190B000000036': No such
file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [6-1]
2021-06-16 14:59:10.556 UTC [29138] {-} LOG:  started streaming WAL from
primary at 190B0/36000000 on timeline 2
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [10-1]
2021-06-16 14:59:10.588 UTC [32792] {-} LOG:  invalid contrecord length
1585 at 190B0/35FFFFB8
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29138]: [7-1]
2021-06-16 14:59:10.588 UTC [29138] {-} FATAL:  terminating walreceiver
process due to administrator command
cp: cannot stat '/data/pg_archive/10/dedupe_shard1_10/00000003.history':
No such file or directory
cp: cannot stat
'/data/pg_archive/10/dedupe_shard1_10/00000002000190B000000035': No such
file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[32792]: [11-1]
2021-06-16 14:59:10.598 UTC [32792] {-} PANIC:  could not open file
"pg_wal/00000002000190B000000035": No such file or directory
Jun 16 14:59:10 serverD postgresql-10-dedupe_shard1_10[29143]: [6-1]
2021-06-16 14:59:10.616 UTC [29143] {-} LOG:  started streaming WAL from
primary at 190B0/35000000 on timeline 2
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [6-1]
2021-06-16 14:59:11.505 UTC [32790] {-} LOG:  startup process (PID
32792) was terminated by signal 6: Aborted
Jun 16 14:59:11 serverD postgresql-10-dedupe_shard1_10[32790]: [7-1]
2021-06-16 14:59:11.506 UTC [32790] {-} LOG:  terminating any other
active server processes

and it came back again too after restart.

Is it all expected? Is the data safe and consistent now or should I
re-init serverC and serverD?

Best regards, Alex

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Vijaykumar Jain 2021-06-16 16:20:34 Re: Streaming replication: PANIC on tertiary when secondary promoted
Previous Message Tom Lane 2021-06-16 15:08:49 Re: A simple question about text fields