Re: Broken replication in 3 node setup, master halt, standby 1 promoted, standby 2 - replication not working

From: Rui DeSousa <rui(dot)desousa(at)icloud(dot)com>
To: Pritam Barhate <pritambarhate(at)gmail(dot)com>
Cc: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Broken replication in 3 node setup, master halt, standby 1 promoted, standby 2 - replication not working
Date: 2018-01-12 19:04:07
Message-ID: 45FEB64D-CE15-40CA-9BF3-DECEAC21B268@icloud.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

You might want to consider sync replication instead to avoid split brain syndrome. It readily depending on how things are promoted and managed.

> On Jan 12, 2018, at 10:50 AM, Pritam Barhate <pritambarhate(at)gmail(dot)com> wrote:
>
> I am using async replication.
>
> As both of you said:
>
> recovery_target_timeline=latest
>
> Was missing. Adding it to the recovery.conf did the trick.
>
> Thanks a lot.
>
> Regards,
>
> Pritam.
>
> On Thu, Jan 11, 2018 at 4:09 AM, Rui DeSousa <rui(dot)desousa(at)icloud(dot)com <mailto:rui(dot)desousa(at)icloud(dot)com>> wrote:
> It appears that your missing is following in the recovery.conf file. It instructs the replica to follow a new timeline when presented; without it will refuse to follow the promoted node.
>
> recovery_target_timeline=latest
>
> Are you using sync replication?
>
>
> > On Jan 10, 2018, at 2:06 PM, Pritam Barhate <pritambarhate(at)gmail(dot)com <mailto:pritambarhate(at)gmail(dot)com>> wrote:
> >
> > Hi,
> >
> > I am trying to test the failover scenario on a brand new 3 node setup.
> >
> > pg1 is master
> > pg2 and pg3 are standby hosts in streaming replication, both following pg1.
> >
> > When I halt pg1, promote pg2 and update recovery.conf on pg3 to point to pg2, the replication is broken. Can you please help me identify the correct procedure to achieve this failover?
> >
> > Here are the details:
> >
> > Software Versions:
> >
> > PostgreSQL: 9.6
> > Ubuntu 16.04
> > PgBackRest 1.25
> >
> > Here is the scenario:
> >
> > Set up 3 nodes in streaming replication:
> >
> > master (pg1), standby 1 (pg2), standby 2 (pg3)
> >
> > Also, I have set up pgbackrest, to back up from standby. It's working correctly.
> >
> > Run the following on pg1:
> >
> > sudo -u postgres psql -x -c "select * from pg_stat_replication;"
> >
> > -[ RECORD 1 ]----+------------------------------
> > pid | 20714
> > usesysid | 16384
> > usename | replicator
> > application_name | walreceiver
> > client_addr | 10.0.15.11
> > client_hostname |
> > client_port | 48640
> > backend_start | 2018-01-10 18:17:46.105694+00
> > backend_xmin |
> > state | streaming
> > sent_location | 0/6000060
> > write_location | 0/6000060
> > flush_location | 0/6000060
> > replay_location | 0/6000060
> > sync_priority | 0
> > sync_state | async
> > -[ RECORD 2 ]----+------------------------------
> > pid | 20715
> > usesysid | 16384
> > usename | replicator
> > application_name | walreceiver
> > client_addr | 10.0.15.12
> > client_hostname |
> > client_port | 42198
> > backend_start | 2018-01-10 18:17:48.712575+00
> > backend_xmin |
> > state | streaming
> > sent_location | 0/6000060
> > write_location | 0/6000060
> > flush_location | 0/6000060
> > replay_location | 0/6000060
> > sync_priority | 0
> > sync_state | async
> >
> >
> > It reports both the standbys working correctly.
> >
> > Halt pg1.
> >
> > On pg2:
> >
> > sudo -u postgres pg_ctlcluster 9.6 main promote
> >
> > On pg3:
> >
> > sudo service postgresql stop
> >
> > sudo -u postgres vim /var/lib/postgresql/9.6/main/recovery.conf
> >
> > modify it's contents are as follows:
> >
> > ```
> > primary_conninfo = 'host=pg2 port=5432 user=replicator'
> > standby_mode = 'on'
> > restore_command = '/usr/bin/pgbackrest --stanza=main archive-get %f "%p"'
> > ```
> >
> > sudo service postgresql start
> >
> > On pg2:
> >
> > sudo -u postgres psql -x -c "select * from pg_stat_replication;"
> >
> > -[ RECORD 1 ]----+------------------------------
> > pid | 21238
> > usesysid | 16384
> > usename | replicator
> > application_name | walreceiver
> > client_addr | 10.0.15.12
> > client_hostname |
> > client_port | 44852
> > backend_start | 2018-01-10 18:22:20.980701+00
> > backend_xmin |
> > state | startup
> > sent_location |
> > write_location |
> > flush_location |
> > replay_location |
> > sync_priority | 0
> > sync_state | async
> >
> > As you can see sent_location, write_location, flush_location all are empty.
> >
> > On pg3:
> >
> > tail -f /var/log/postgresql/postgresql-9.6-main.log
> > 2018-01-10 18:22:19.490 UTC [21229] LOG: restored log file "000000010000000000000005" from archive
> > 2018-01-10 18:22:19.497 UTC [21229] LOG: redo starts at 0/5000028
> > 2018-01-10 18:22:19.757 UTC [21243] postgres(at)postgres FATAL: the database system is starting up
> > 2018-01-10 18:22:20.188 UTC [21229] LOG: restored log file "000000010000000000000006" from archive
> > 2018-01-10 18:22:20.195 UTC [21229] LOG: consistent recovery state reached at 0/7000000
> > 2018-01-10 18:22:20.195 UTC [21228] LOG: database system is ready to accept read only connections
> > 2018-01-10 18:22:20.812 UTC [21229] LOG: invalid magic number 0000 in log segment 000000010000000000000007, offset 0
> > 2018-01-10 18:22:20.824 UTC [21255] LOG: fetching timeline history file for timeline 2 from primary server
> > 2018-01-10 18:22:20.828 UTC [21255] LOG: primary server contains no more WAL on requested timeline 1
> > 2018-01-10 18:22:21.417 UTC [21255] LOG: primary server contains no more WAL on requested timeline 1
> >
> >
> > At this point is I insert data in pg2, it doesn't reflect in pg3.
> >
> > Please help.
> >
> > Regards,
> >
> > Pritam.
> >
>
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Karl Denninger 2018-01-12 19:27:46 Re: Rolling forward to 10.1, new issue...
Previous Message Keith 2018-01-12 18:46:53 Re: Rolling forward to 10.1, new issue...