| From: | Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au> |
|---|---|
| To: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
| Cc: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
| Subject: | Re: pg_rewind - restore new slave failed to startup during recovery |
| Date: | 2017-08-22 04:11:49 |
| Message-ID: | 42fbdfcee16049f184aebbdd5441d876@ITUPW-EXMBOX2B.UniNet.unisa.edu.au |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-general |
Thanks Michael.
> First things first. What is the version of PostgreSQL involved here?
The PostgreSQL is version 9.6.
>4 is the number of the last timeline the promoted standby has been using, right?
The history file in pg_xlog, is dated at the time of promotion on the standby (current master)
-rw-------. 1 postgres postgres 131 Aug 21 13:26 00000004.history
$ more 00000004.history
1 20/5C000098 no recovery target specified
2 76/F8000098 no recovery target specified
3 9C/7CC50680 no recovery target specified
> What are the contents of /pg_backup/backup/archive_sync/?
The archive folder is /pg_backup/backup/archive, I ftp'ed all the contents from the /pg_backup/backup/archive folder from the new master to the /pg_backup/backup/archive_sync on the new slave.
-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Tuesday, 22 August 2017 10:37 AM
To: Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] pg_rewind - restore new slave failed to startup during recovery
On Tue, Aug 22, 2017 at 9:52 AM, Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au> wrote:
> I have 1 master and 1 slave wal streaming replication setup and the
> Application connects via a load balancer (LTM) where the all
> connections are redirected to the master member (master db).
>
> We have archive_mode enabled.
First things first. What is the version of PostgreSQL involved here?
> I am trying to test to use pg_rewind to restore the new slave (old
> master) after a failover while the system is under load.
Don't worry. pg_rewind works :)
> Here are the steps I take to test:
>
> 1. Disable the master ltm member (all connections redired to slave
> member)
> 2. Promote slave (touch promote.me)
> 3. Stop the master db (old master)
> 4. Do pg_rewind on the new slave (old master)
> 5. Start the new slave.
That flow looks correct to me. No I think that you should trigger manually a checkpoint after step 2 on the promoted standby so as its control file gets forcibly updated correctly with its new timeline number. This is a small but critical point people usually miss. The documentation of pg_rewind does not mention this point when using a live source server, and many people have fallen into this trap up to now... We should really mention that in the docs. What do others think?
> Checking the on the new master, I see that the check point that its
> trying to restore is the file 000000040000009C0000006F, but the file
> does not exist anywhere on the new master. Not in the pg_xlog or the
> archive folder. (as specified in the postgresql.conf)
4 is the number of the last timeline the promoted standby has been using, right?
> Please see attached psql.jpg.
>
> Here is my recovery.conf :
> standby_mode = 'on'
> primary_conninfo = 'host=10.69.19.18 user=replicant’
> trigger_file = '/var/run/promote_me'
> restore_command = 'cp /pg_backup/backup/archive_sync/%f "%p"'
>
> does anyone know why?
What are the contents of /pg_backup/backup/archive_sync/? Are you sure that the promoted standby has archived correctly the first segment of its new timeline for example?
> Under what conditions will pg_rewind wont’ work?
Only one WAL segment missing would prevent any base backup or rewound node to reach a consistent point. You need to be careful about the contents of your archives. Now a failover done correctly is a tricky thing, which could likely fail if you don't issue a checkpoint immediately on the promoted standby if pg_rewind is kicked in the process before an automatic checkpoint happens (because of timeout or volume, whichever).
--
Michael
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Abhijit Gharami | 2017-08-22 04:58:41 | Fwd: PostgreSQL 9.4.13 is facing issue in shutting down |
| Previous Message | Igor Korot | 2017-08-22 04:08:44 | What is the proper query |