Re: Trouble using pg_rewind to undo standby promotion

From: Torsten Förtsch <tfoertsch123(at)gmail(dot)com>
To: Craig McIlwee <craigm(at)vt(dot)edu>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Trouble using pg_rewind to undo standby promotion
Date: 2024-11-07 09:46:49
Message-ID: CAKkG4_kt9zZNPwB2Qd0-jAtRn3m=JPWW_-hc8VSyhfZ+VjfKQw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Your point of divergence is in the middle of the 7718/000000BF file. So,
you should have 2 such files eventually, one on timeline 1 and the other on
timeline 2.

Are you archiving WAL on the promoted machine in a way that your
restore_command can find it? Check archive_command and archive_mode on the
promoted machine.

Also, do your archive/restore scripts work properly for history files?

On Wed, Nov 6, 2024 at 7:48 PM Craig McIlwee <craigm(at)vt(dot)edu> wrote:

> I have a primary -> standby 1 -> standby 2 setup with all servers running
> PG 13.8 (this effort is part of getting on to a newer version, but I think
> those details aren't relevant to this problem). The first standby uses
> streaming replication from the primary and the second standby is using a
> WAL archive with a restore_command. To make this standby chain work,
> standby 1 is set to archive_mode = always with a command that populates the
> WAL archive.
>
> I would like to be able to promote standby 2 (hereon referred to just as
> 'standby'), perform some writes, then rewind it back to the point before
> promotion so it can become a standby again. The documentation for
> pg_rewind says that this is supported and it seems like it should be
> straightforward, but I'm not having any luck getting this to work so I'm
> hoping someone can point out what I'm doing wrong. Here's what I did:
>
> First, observe that WAL is properly being applied from the archive. Note
> that we are currently on timeline 1.
>
> 2024-11-06 09:51:23.286 EST [5438] LOG: restored log file
> "0000000100007711000000F9" from archive
> 2024-11-06 09:51:23.434 EST [5438] LOG: restored log file
> "0000000100007711000000FA" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007711000000FB.gz: No such file or
> directory
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
>
> Next, stop postgres, set wal_log_hints = on as required by pg_rewind, and
> restart postgres. I also make a copy of the data directory while the
> postgres is not running so I can repeat my test, which works fine on a
> small test database but won't be possible for the multi TB database that I
> will eventually be doing this on.
>
> Now promote the standby using "select pg_promote()" and see that it
> switches to a new timeline. You can also see that the last WAL applied
> from the archive is 7718/BF.
>
> 2024-11-06 12:10:10.831 EST [4336] LOG: restored log file
> "0000000100007718000000BD" from archive
> 2024-11-06 12:10:10.996 EST [4336] LOG: restored log file
> "0000000100007718000000BE" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007718000000BF.gz: No such file or
> directory
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
> 2024-11-06 12:10:15.384 EST [4336] LOG: restored log file
> "0000000100007718000000BF" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000100007718000000C0.gz: No such file or
> directory
> 2024-11-06 12:10:15.457 EST [4336] LOG: received promote request
> 2024-11-06 12:10:15.457 EST [4336] LOG: redo done at 7718/BFFFFF30
> 2024-11-06 12:10:15.457 EST [4336] LOG: last completed transaction was at
> log time 2024-11-06 12:10:22.627074-05
> 2024-11-06 12:10:15.593 EST [4336] LOG: restored log file
> "0000000100007718000000BF" from archive
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000002.history.gz: No such file or directory
> 2024-11-06 12:10:15.611 EST [4336] LOG: selected new timeline ID: 2
> 2024-11-06 12:10:15.640 EST [4336] LOG: archive recovery complete
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/00000001.history.gz: No such file or directory
> 2024-11-06 12:10:17.028 EST [4329] LOG: database system is ready to
> accept connections
>
> Next, insert a record into just to make some changes that I want to
> rollback later. (What I will eventually be doing is creating a publication
> to ship data to a newer version, but again, that's not what's important
> here.)
>
> Finally, shutdown postgres and attempt a rewind. The address used in the
> --source-server connection string is the address of the primary.
>
> 2024-11-06 12:11:11.139 EST [4329] LOG: received fast shutdown request
> 2024-11-06 12:11:11.143 EST [4329] LOG: aborting any active transactions
> 2024-11-06 12:11:11.144 EST [4329] LOG: background worker "logical
> replication launcher" (PID 5923) exited with exit code 1
> 2024-11-06 12:11:40.933 EST [4342] LOG: shutting down
> 2024-11-06 12:11:41.753 EST [4329] LOG: database system is shut down
>
> /usr/pgsql-13/bin/pg_rewind --target-pgdata=/data/pgsql/operational
> --source-server="host=x.x.x.x dbname=postgres user=xxx password=xxx"
> --dry-run --progress --restore-target-wal
>
> pg_rewind: connected to server
> pg_rewind: servers diverged at WAL location 7718/BFFFFFE8 on timeline 1
> /data/wal_archive/restore_operational.sh: line 2:
> /data/wal_archive/operational/0000000200007718000000BF.gz: No such file or
> directory
> pg_rewind: error: could not restore file "0000000200007718000000BF" from
> archive
> pg_rewind: fatal: could not find previous WAL record at 7718/BFFFFFE8
>
> pg_rewind shows the point of divergence as 7718/BF on timeline 1, but when
> it tries to replay WAL using the restore command it is trying to find WAL
> from timeline 2 rather than picking back up on timeline 1. I tried
> setting recovery_target_timeline on the target database to 'current' and
> '1' but that gave the same result. Searching the archives, [1] mentions the
> need to force a checkpoint after promotion which I tried even though the
> problem description isn't the same. [2] mentions a problem that looks more
> like the one I am facing but has no responses. At this point I don't know
> what to do next and hope someone can point me in the right direction.
>
> [1]
> https://www.postgresql.org/message-id/e7b16ddea93a92575cb6d143b6ef602cab22432e.camel%40cybertec.at
> [2]
> https://www.postgresql.org/message-id/CALp3DH1fLZmPvkOteAbUo4TOLZP-LstKOs6Gcw3Bm7acmJqk=w@mail.gmail.com
>
> Craig
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Craig McIlwee 2024-11-07 13:26:46 Re: Trouble using pg_rewind to undo standby promotion
Previous Message Don Seiler 2024-11-07 03:41:02 Re: Index Partition Size Double of its Table Partition?