Re: Broken replication in 3 node setup, master halt, standby 1 promoted, standby 2 - replication not working

From: Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com>
To: Pritam Barhate <pritambarhate(at)gmail(dot)com>
Cc: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Broken replication in 3 node setup, master halt, standby 1 promoted, standby 2 - replication not working
Date: 2018-01-14 04:33:10
Message-ID: CAGDYbUPinWq3xpQJLZ3ZRDabFAMgTwBwpy4M-2DReqeRps6Tjg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Pritam Barhate,

Please check after promoting pg2 as master pg3 is up and running.
Provide Ip address of pg2 in recovery.conf of pg3 instead hostname and
pass *recovery_target_timeline='latest'* ​
Also check archive_command in postgresql.conf of pg2 and restore_command in
recovery.conf of pg3 is identical. and permission for pgbackrest.
​​

Hope this helps..

Best Regards,
Shreeyansh

On Thu, Jan 11, 2018 at 12:36 AM, Pritam Barhate <pritambarhate(at)gmail(dot)com>
wrote:

> Hi,
>
> I am trying to test the failover scenario on a brand new 3 node setup.
>
> pg1 is master
> pg2 and pg3 are standby hosts in streaming replication, both following
> pg1.
>
> When I halt pg1, promote pg2 and update recovery.conf on pg3 to point to
> pg2, the replication is broken. Can you please help me identify the correct
> procedure to achieve this failover?
>
> Here are the details:
>
> Software Versions:
>
> PostgreSQL: 9.6
> Ubuntu 16.04
> PgBackRest 1.25
>
> Here is the scenario:
>
> Set up 3 nodes in streaming replication:
>
> master (pg1), standby 1 (pg2), standby 2 (pg3)
>
> Also, I have set up pgbackrest, to back up from standby. It's working
> correctly.
>
> Run the following on pg1:
>
> sudo -u postgres psql -x -c "select * from pg_stat_replication;"
>
> -[ RECORD 1 ]----+------------------------------
> pid | 20714
> usesysid | 16384
> usename | replicator
> application_name | walreceiver
> client_addr | 10.0.15.11
> client_hostname |
> client_port | 48640
> backend_start | 2018-01-10 18:17:46.105694+00
> backend_xmin |
> state | streaming
> sent_location | 0/6000060
> write_location | 0/6000060
> flush_location | 0/6000060
> replay_location | 0/6000060
> sync_priority | 0
> sync_state | async
> -[ RECORD 2 ]----+------------------------------
> pid | 20715
> usesysid | 16384
> usename | replicator
> application_name | walreceiver
> client_addr | 10.0.15.12
> client_hostname |
> client_port | 42198
> backend_start | 2018-01-10 18:17:48.712575+00
> backend_xmin |
> state | streaming
> sent_location | 0/6000060
> write_location | 0/6000060
> flush_location | 0/6000060
> replay_location | 0/6000060
> sync_priority | 0
> sync_state | async
>
>
> It reports both the standbys working correctly.
>
> Halt pg1.
>
> On pg2:
>
> sudo -u postgres pg_ctlcluster 9.6 main promote
>
> On pg3:
>
> sudo service postgresql stop
>
> sudo -u postgres vim /var/lib/postgresql/9.6/main/recovery.conf
>
> modify it's contents are as follows:
>
> ```
> primary_conninfo = 'host=pg2 port=5432 user=replicator'
> standby_mode = 'on'
> restore_command = '/usr/bin/pgbackrest --stanza=main archive-get %f "%p"'
> ```
>
> sudo service postgresql start
>
> On pg2:
>
> sudo -u postgres psql -x -c "select * from pg_stat_replication;"
>
> -[ RECORD 1 ]----+------------------------------
> pid | 21238
> usesysid | 16384
> usename | replicator
> application_name | walreceiver
> client_addr | 10.0.15.12
> client_hostname |
> client_port | 44852
> backend_start | 2018-01-10 18:22:20.980701+00
> backend_xmin |
> state | startup
> sent_location |
> write_location |
> flush_location |
> replay_location |
> sync_priority | 0
> sync_state | async
>
> As you can see sent_location, write_location, flush_location all are
> empty.
>
> On pg3:
>
> tail -f /var/log/postgresql/postgresql-9.6-main.log
> 2018-01-10 18:22:19.490 UTC [21229] LOG: restored log file
> "000000010000000000000005" from archive
> 2018-01-10 18:22:19.497 UTC [21229] LOG: redo starts at 0/5000028
> 2018-01-10 18:22:19.757 UTC [21243] postgres(at)postgres FATAL: the
> database system is starting up
> 2018-01-10 18:22:20.188 UTC [21229] LOG: restored log file
> "000000010000000000000006" from archive
> 2018-01-10 18:22:20.195 UTC [21229] LOG: consistent recovery state
> reached at 0/7000000
> 2018-01-10 18:22:20.195 UTC [21228] LOG: database system is ready to
> accept read only connections
> 2018-01-10 18:22:20.812 UTC [21229] LOG: invalid magic number 0000 in log
> segment 000000010000000000000007, offset 0
> 2018-01-10 18:22:20.824 UTC [21255] LOG: fetching timeline history file
> for timeline 2 from primary server
> 2018-01-10 18:22:20.828 UTC [21255] LOG: primary server contains no more
> WAL on requested timeline 1
> 2018-01-10 18:22:21.417 UTC [21255] LOG: primary server contains no more
> WAL on requested timeline 1
>
>
> At this point is I insert data in pg2, it doesn't reflect in pg3.
>
> Please help.
>
> Regards,
>
> Pritam.
>
>

--
[image: http://www.shreeyansh.com] <http://www.shreeyansh.com>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message nagaraj L M 2018-01-14 16:32:55 Re: Please remove me from the list
Previous Message Shivakumar Ramannavar 2018-01-12 21:41:39 Re: Please remove me from the mailing list....