Using pg_rewind for DR testing / missing wal files

From: Matthew Tice <mjtice(at)gmail(dot)com>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Using pg_rewind for DR testing / missing wal files
Date: 2021-03-17 02:16:48
Message-ID: CA+taBv8N3Q2CR7UOQH3ci=MR0T4dkygM65JLG+SVf9MgGY8RHg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi all,

Currently I'm running pgsql 10.4 on a centos7 machine as a primary
database. I have a standby database in GCP with the same OS/pgsql version.

What I'm trying to do is promote the standby every morning around 0600 so
that applications can connect and teams can do some testing throughout the
day. When dev teams are done testing around 1800 I want to pg_rewind the
standby so it gets caught back up where I can then repeat this process the
next day.

Things work okay for the most part but one hitch I'm running into that I
can't wrap my head around is this:

1. I promoted the standby today at about 0623.
2. I left it running all day
3. At about 1950 I went to pg_rewind the standby using the following
command:
/usr/pgsql-10/bin/pg_rewind -D /var/lib/pgsql/10/data/
--source-server='postgresql://drusr:x(at)10(dot)20(dot)6(dot)10/postgres' -P

It immediately failed that it couldn't find a wal file:
connected to server
servers diverged at WAL location 22C3/99779318 on timeline 2
could not open file
"/var/lib/pgsql/10/data//pg_wal/00000002000022C300000098": No such file or
directory

Now, looking at the wal file in my archive log dir on the primary, the
timestamp of the file was about 0623.

I scp'd the wal to the standby and reran the pg_rewind command. This time
is couldn't find the file before:
connected to server
servers diverged at WAL location 22C3/99779318 on timeline 2
could not open file
"/var/lib/pgsql/10/data//pg_wal/00000002000022C300000097": No such file or
directory

I repeated this process for about 8 wal files where the last one
"/var/lib/pgsql/10/data//pg_wal/00000002000022C300000092" was timestamp at
0621 - two minutes before I promoted the standby.

Going through the mailing list archives I thought I read somewhere that
maybe I needed to checkpoint the standby in order to prevent this very
issue - but that doesn't seem to help.

My question is, how can I achieve the behavior I'd like where pg_rewind
"just works" (I don't have to copy wal files from the primary to the
standby).

I did find an email in the archives from 2018 with the subject "pg_rewind:
confused about recovery.conf/restore_command" which the author thought that
the restore_command may be the solution - but didn't actually work out.

Thanks,

Matt

Browse pgsql-admin by date

  From Date Subject
Next Message Yambu 2021-03-17 12:05:10 Encryption key storage
Previous Message soumik.bhattacharjee 2021-03-15 15:08:46 RE: Pseudocode / Pseudo-Types