Timeline switch problem with streaming replication with 3 nodes

From: Mads(dot)Tandrup(at)schneider-electric(dot)com
To: pgsql-general(at)postgresql(dot)org
Subject: Timeline switch problem with streaming replication with 3 nodes
Date: 2012-09-24 12:37:33
Message-ID: OF80BBB332.B495F5C6-ONC1257A83.00430216-C1257A83.00455AFC@apcc.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general


Hi All

I've set up a 3 postgresql nodes 1 master and 2 slaves. They have been
configured for streaming replication with synchronous on. I've set up an
virtual IP that points to the current master node.

When I kill the master node. The slave that was synchronous gets promoted
to master and gets the shared virtual IP

But sometimes the other slave don't accept the switch and instead the log
on the slave says:

2012-09-24 10:45:06 GMT 4663 FATAL: replication terminated by primary
server
2012-09-24 10:45:06 GMT 4662 LOG: record with zero length at 0/200009E8
2012-09-24 10:45:06 GMT 10209 FATAL: could not connect to the primary
server: could not connect to server: Connection refused
Is the server running on host "10.216.73.60" and accepting
TCP/IP connections on port 5432?

2012-09-24 10:45:11 GMT 10272 FATAL: could not connect to the primary
server: FATAL: recovery is still in progress, can't accept WAL streaming
connections

2012-09-24 10:45:16 GMT 10326 FATAL: timeline 10 of the primary does not
match recovery target timeline 9
2012-09-24 10:45:21 GMT 10388 FATAL: timeline 10 of the primary does not
match recovery target timeline 9
2012-09-24 10:45:26 GMT 10451 FATAL: timeline 10 of the primary does not
match recovery target timeline 9
...

And it continues to repeat the last line.

The new master says:
2012-09-24 10:45:06 GMT 8394 FATAL: replication terminated by primary
server
2012-09-24 10:45:06 GMT 8393 LOG: record with zero length at 0/200009E8
2012-09-24 10:45:11 GMT 8393 LOG: trigger file
found: /tmp/postgresql_trigger
2012-09-24 10:45:11 GMT 8393 LOG: redo done at 0/20000990
2012-09-24 10:45:11 GMT 8393 LOG: last completed transaction was at log
time 2012-09-24 10:45:01.917175+00
2012-09-24 10:45:11 GMT 8393 LOG: selected new timeline ID: 10
2012-09-24 10:45:11 GMT 10741 [unknown] FATAL: recovery is still in
progress, can't accept WAL streaming connections
2012-09-24 10:45:12 GMT 8393 LOG: archive recovery complete
2012-09-24 10:45:12 GMT 8391 LOG: database system is ready to accept
connections
2012-09-24 10:45:12 GMT 10743 LOG: autovacuum launcher started

The recovery.conf is:
standby_mode = 'on'
primary_conninfo = 'host=10.216.73.60 port=5432 user=root password=onyx
application_name=10.216.73.195'
recovery_target_timeline = 'latest'
trigger_file = '/tmp/postgresql_trigger'

I've found a discussion
(http://archives.postgresql.org/pgsql-general/2011-12/msg00553.php) on a
similar issue a while back. They talk about sharing WAL files as the
solution. But I thought that the idea with streaming replication was that I
would not need a shared storage.

Is that the only solution or is there another way?

Best regards,
Mads

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Merlin Moncure 2012-09-24 12:53:22 Re: 9.1 vs 8.4 performance
Previous Message salah jubeh 2012-09-24 10:47:47 Re: 9.1 vs 8.4 performance