Re: How to recover correctly master and replica using backup made by pg_basebackup?

From: Evgeniy Ratkov <e(dot)ratkov(at)arenadata(dot)io>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: How to recover correctly master and replica using backup made by pg_basebackup?
Date: 2025-03-18 13:02:55
Message-ID: 54111742302195@mail.yandex.ru
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


On 07/11/2024 22:19, Evgeniy Ratkov wrote:

Hello.

I try to use pg_basebackup to make backup and recover master and replica

from it.

I recover master with action "promote". Next, I recover replica from the

same

backup with action "shutdown". After it, I start replica with configured

connection

and replication slot to the master. I get error, because replica reads

and applies

all WAL of irrelevant timeline, which contains logs after restore point.

I wrote Perl script for pg_basebackup's TAP tests. It is attached as

file "050_check_replication.pl".

It works on PostgreSQL 17, commit

7d85d87f4d5c35fd5b2d38adaef63dfbfa542ccc after

applying small patch, which adds possibility to start node with shutdown

option.

Patch is attached as file "add_node_start_with_shutdown.diff".

Script generates situation when timeline changes several times. Small

schema may

explain it:

Timeline 1:

   Create the first backup

   Create restore point 'rp_repl_bug'

Timeline 1 -> 2:

   Recovery with promote using the first backup and 'rp_repl_bug'

   Create the second backup

   Create restore point 'rp_repl_bug2'

Timeline 2 -> 3:

   Recovery with promote using the second backup and 'rp_repl_bug2'

Timeline 1 -> 4:

   Recovery with promote using the first backup and 'rp_repl_bug'

After it I get "00000004.history" file, which looks strange:

1    0/3000090    at restore point "rp_repl_bug"

2    0/5000090    at restore point "rp_repl_bug2"

3    0/3000090    at restore point "rp_repl_bug"

Next, I try to restore replica from the first backup. Replica did not

find WAL for

timeline 4 and downloaded and applied WAL for timeline 2.

cp: cannot stat '${ARCHIVE_PATH}/000000040000000000000003': No such file

or directory

LOG:  redo starts at 0/2000028

cp: cannot stat '${ARCHIVE_PATH}/000000040000000000000003': No such file

or directory

'${ARCHIVE_PATH}/000000020000000000000003' ->

'${REPLICA_PGDATA}/pg_wal/RECOVERYXLOG'

LOG:  restored log file "000000020000000000000003" from archive

LOG:  completed backup recovery with redo LSN 0/2000028 and end LSN

0/2000120

LOG:  consistent recovery state reached at 0/2000120

LOG:  recovery stopping at restore point "rp_repl_bug", time 2024-11-05

22:00:16.037126+03

Than, at the next replica's start (with configured connection to the

master and

replication slot) replica can not start receiving WAL from master,

because it

applied all WAL from timeline 2 including all lines after LSN, where it

should

switch timeline:

LOG:  entering standby mode

LOG:  redo starts at 0/2000028

LOG:  consistent recovery state reached at 0/3000090

LOG:  database system is ready to accept read-only connections

LOG:  started streaming WAL from primary at 0/4000000 on timeline 4

FATAL:  could not receive data from WAL stream: ERROR:  requested

starting point 0/4000000 is ahead of the WAL flush position of this

server 0/3002000

If I restart replica, it will not start:

LOG:  database system was shut down in recovery at 2024-11-05 22:00:18 MSK

LOG:  entering standby mode

FATAL:  requested timeline 4 is not a child of this server's history

DETAIL:  Latest checkpoint is at 0/3000090 on timeline 2, but in the

history of the requested timeline, the server forked off from that

timeline at 0/5000090.

LOG:  startup process (PID 2384880) exited with exit code 1

Questions:

1) Why does mirror not switch to the next timeline when it get LSN in

WAL, which

there is in the history file? And why does mirror apply all WAL?

2) Why does 00000004.history contain info about switching from 2 and 3

timelines

even if the 4 timeline was switched from the first timeline?

Thank you in advance.

 


There is Heikki Linnakangas's patch:


0003-Don-t-read-past-current-TLI-during-archive-recovery.patch from


https://www.postgresql.org/message-id/fe7af87d-b9bf-481a-902c-089c49aa911d%40iki.fi.


It fixes my problem with recovery backup on standby.


 


I refactored the test, which may show the problem. The test passed


with Heikki Linnakangas's patch. Without patch it fails with error:


"FATAL:  could not receive data from WAL stream: ERROR:  requested


starting point 0/4000000 is ahead of the WAL flush position of this


server 0/3002000".


 


The test is attached as "050_check_recovery_backup.pl". I added


the test to "src/test/recovery/t" directory.


 


Thank you.


 


--


Evgeniy Ratkov

Attachment Content-Type Size
unknown_filename text/html 5.0 KB
050_check_recovery_backup.pl application/octet-stream 5.7 KB

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Evgeniy Ratkov 2025-03-18 13:27:48 Re: BUG #18575: Sometimes pg_rewind mistakenly assumes that nothing needs to be done.
Previous Message Greg Sabino Mullane 2025-03-18 12:49:22 Re: Cannot pg_dump_all anymore...