pg cluster not cleaning up after failover

From: Peter Brunnengräber <pbrunnen(at)bccglobal(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: pg cluster not cleaning up after failover
Date: 2016-07-13 15:58:50
Message-ID: 1941931359.108.1468425520834.JavaMail.pbrunnen@Station8.local
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello all,
I'm having an issue with a postgresql 9.2 cluster during failover and hope you all can help. I have been attempting to follow the guide provided at ClusterLabs(1) but not having much luck and I don't quite understand where the issue is. I'm running on debian wheezy.

I have my crm_mon output below. One server is PRI and operating normally after taking over. I have pg setup to do the wal archiving via rsync to the opposite node. <archive_command = 'rsync -a %p test-node2:/db/data/postgresql/9.2/pg_archive/%f'> The rsync is working and I do see WAL files going to the other host appropriately.

Node2 was the PRI... So after node1 that was previously in HA:sync promoted last night to PRI and node2 is stopped. The WAL files are arriving from node1 on node2. I cleaned-up the /tmp/PGSQL.lock file and proceed with a pg_basebackup restore from node1. This all went well without error in the node1 postgresql log.

After running a crm cleanup on the msPostgresql resource, node2 keeps showing 'LATEST' but gets hung up at HS:alone. Plus I don't understand why the xlog-loc of node2 shows 0000001EB9053DD8 which is farther ahead of node1's master-baseline of 0000001EB2000080. I saw the 'cannot stat ... 000000010000001E000000BB' error, but that seems to always happen for the current xlog filename.

And if I wasn't confused enough, the pg log on node2 says "streaming replication successfully connected to primary" and the pg_stat_replication query on node1 shows connected, but ASYNC.

Any ideas?

Very much appreciated!
-With kind regards,
Peter Brunnengräber

References:
(1) http://clusterlabs.org/wiki/PgSQL_Replicated_Cluster#after_fail-over

###
============
Last updated: Wed Jul 13 14:51:53 2016
Last change: Wed Jul 13 14:49:17 2016 via crmd on test-node2
Stack: openais
Current DC: test-node1 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
4 Resources configured.
============

Online: [ test-node1 test-node2 ]

Full list of resources:

Resource Group: g_master
ClusterIP-Net1 (ocf::heartbeat:IPaddr2): Started test-node1
ReplicationIP-Net2 (ocf::heartbeat:IPaddr2): Started test-node1
Master/Slave Set: msPostgresql [pgsql]
Masters: [ test-node1 ]
Slaves: [ test-node2 ]

Node Attributes:
* Node test-node1:
+ master-pgsql:0 : 1000
+ master-pgsql:1 : 1000
+ pgsql-data-status : LATEST
+ pgsql-master-baseline : 0000001EB2000080
+ pgsql-status : PRI
* Node test-node2:
+ master-pgsql:0 : -INFINITY
+ master-pgsql:1 : -INFINITY
+ pgsql-data-status : LATEST
+ pgsql-status : HS:alone
+ pgsql-xlog-loc : 0000001EB9053DD8

Migration summary:
* Node test-node2:
* Node test-node1:

#### Node2
2016-07-13 14:55:09 UTC LOG: database system was interrupted; last known up at 2016-07-13 14:54:27 UTC
2016-07-13 14:55:09 UTC LOG: creating missing WAL directory "pg_xlog/archive_status"
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG: entering standby mode
2016-07-13 14:55:09 UTC LOG: restored log file "000000010000001E000000BA" from archive
2016-07-13 14:55:09 UTC FATAL: the database system is starting up
2016-07-13 14:55:09 UTC LOG: redo starts at 1E/BA000020
2016-07-13 14:55:09 UTC LOG: consistent recovery state reached at 1E/BA05FED8
2016-07-13 14:55:09 UTC LOG: database system is ready to accept read only connections
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/000000010000001E000000BB': No such file or directory
cp: cannot stat `/db/data/postgresql/9.2/pg_archive/00000002.history': No such file or directory
2016-07-13 14:55:09 UTC LOG: streaming replication successfully connected to primary

#### Node1
postgres=# select application_name,upper(state),upper(sync_state) from pg_stat_replication;
+------------------+-----------+-------+
| application_name | upper | upper |
+------------------+-----------+-------+
| test-node2 | STREAMING | ASYNC |
+------------------+-----------+-------+
(1 row)

Browse pgsql-admin by date

  From Date Subject
Next Message Nguyen Hoai Nam 2016-07-14 04:08:14 Create extension without superuser
Previous Message Nguyen Hoai Nam 2016-07-11 02:41:58 Re: The problem is related to concurrent resquests