Re: pg_rewind - restore new slave failed to startup during recovery

From: Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: Re: pg_rewind - restore new slave failed to startup during recovery
Date: 2017-08-22 04:11:49
Message-ID: 42fbdfcee16049f184aebbdd5441d876@ITUPW-EXMBOX2B.UniNet.unisa.edu.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks Michael.

> First things first. What is the version of PostgreSQL involved here?

The PostgreSQL is version 9.6.

>4 is the number of the last timeline the promoted standby has been using, right?

The history file in pg_xlog, is dated at the time of promotion on the standby (current master)
-rw-------. 1 postgres postgres 131 Aug 21 13:26 00000004.history

$ more 00000004.history
1 20/5C000098 no recovery target specified

2 76/F8000098 no recovery target specified

3 9C/7CC50680 no recovery target specified

> What are the contents of /pg_backup/backup/archive_sync/?

The archive folder is /pg_backup/backup/archive, I ftp'ed all the contents from the /pg_backup/backup/archive folder from the new master to the /pg_backup/backup/archive_sync on the new slave.

-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Tuesday, 22 August 2017 10:37 AM
To: Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: [GENERAL] pg_rewind - restore new slave failed to startup during recovery

On Tue, Aug 22, 2017 at 9:52 AM, Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au> wrote:
> I have 1 master and 1 slave wal streaming replication setup and the
> Application connects via a load balancer (LTM) where the all
> connections are redirected to the master member (master db).
>
> We have archive_mode enabled.

First things first. What is the version of PostgreSQL involved here?

> I am trying to test to use pg_rewind to restore the new slave (old
> master) after a failover while the system is under load.

Don't worry. pg_rewind works :)

> Here are the steps I take to test:
>
> 1. Disable the master ltm member (all connections redired to slave
> member)
> 2. Promote slave (touch promote.me)
> 3. Stop the master db (old master)
> 4. Do pg_rewind on the new slave (old master)
> 5. Start the new slave.

That flow looks correct to me. No I think that you should trigger manually a checkpoint after step 2 on the promoted standby so as its control file gets forcibly updated correctly with its new timeline number. This is a small but critical point people usually miss. The documentation of pg_rewind does not mention this point when using a live source server, and many people have fallen into this trap up to now... We should really mention that in the docs. What do others think?

> Checking the on the new master, I see that the check point that its
> trying to restore is the file 000000040000009C0000006F, but the file
> does not exist anywhere on the new master. Not in the pg_xlog or the
> archive folder. (as specified in the postgresql.conf)

4 is the number of the last timeline the promoted standby has been using, right?

> Please see attached psql.jpg.
>
> Here is my recovery.conf :
> standby_mode = 'on'
> primary_conninfo = 'host=10.69.19.18 user=replicant’
> trigger_file = '/var/run/promote_me'
> restore_command = 'cp /pg_backup/backup/archive_sync/%f "%p"'
>
> does anyone know why?

What are the contents of /pg_backup/backup/archive_sync/? Are you sure that the promoted standby has archived correctly the first segment of its new timeline for example?

> Under what conditions will pg_rewind wont’ work?

Only one WAL segment missing would prevent any base backup or rewound node to reach a consistent point. You need to be careful about the contents of your archives. Now a failover done correctly is a tricky thing, which could likely fail if you don't issue a checkpoint immediately on the promoted standby if pg_rewind is kicked in the process before an automatic checkpoint happens (because of timeout or volume, whichever).
--
Michael

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Abhijit Gharami 2017-08-22 04:58:41 Fwd: PostgreSQL 9.4.13 is facing issue in shutting down
Previous Message Igor Korot 2017-08-22 04:08:44 What is the proper query