Re: postgresql 9.3 failover time

From: David Steele <david(at)pgmasters(dot)net>
To: Shay Cohavi <cohavisi(at)gmail(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: postgresql 9.3 failover time
Date: 2015-12-14 20:40:26
Message-ID: 566F293A.5030704@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 12/12/15 2:08 PM, Shay Cohavi wrote:
> *I have postgresql 9.3 setup with 2 nodes (active/standby with streaming
> replication & continuos archiving).*
> *I have created 2 failover & failback script in order to perform a
> switchover between the DB servers:*
> *1. failover - create a trigger file in order to promote the new primary.*
> *2. failback - perform a base backup as mentions in :*
> * a. start backup on the primary.*
> * b. stop the failed node .*
> * *didn't delete the DB directory on the failed node
> * c. performing rsync between the nodes.*

If you use rsync here be sure to use checksums. The clusters are very
similar to each other and rsync timestamp resolution could become a problem.

> * d.stopping the backup on the primary.*
> * e.performing rsync on the pg_xlog.*
> * f. creating a recovery.conf*
> /
> / standby_mode = 'on'/
> / primary_conninfo = 'host=10.50.1.153 port=5432 user=usr password=pass'/
> / restore_command = 'scp 10.50.1.153:/home/postgres/archive/%f %p'/
> / trigger_file = '/home/postgres/databases/fabrix/trigger'/
> / archive_cleanup_command = 'ssh 10.50.1.153
> /home/postgres/pg_utils/archive_cleanup.sh %r'/
> *
> * g. starting the failed node as secondary.*
> *
> *the switchover method:*
> *1. stop the primary node.*
> *2. promote the secondary node (failover.sh).*
> *3. perform failback on the failed node.*
> *4. start the failed node.*
> *
> *this method works great! *
> *
> *but if I perform multiple switchovers (>20), each time the new primary
> gets promoted (trigger file) - it takes longer because it searches the
> timelines on the archive. *

This is an indication that your backup/restore process is not working
correctly. Postgres should only look for timelines that are greater
than the current timeline.

> *for example:*
>
> /[2015-12-12 20:35:10.769 IST] LOG: trigger file found:
> /home/postgres/databases/fabrix/trigger/
> /[2015-12-12 20:35:10.769 IST] FATAL: terminating walreceiver process
> due to administrator command/
> /scp: /home/postgres/archive/0000009400000002000000DC: No such file or
> directory/
> /[2015-12-12 20:35:10.893 IST] LOG: record with zero length at 2/DC000168/
> /[2015-12-12 20:35:10.893 IST] LOG: redo done at 2/DC000100/
> /scp: /home/postgres/archive/0000009400000002000000DC: No such file or
> directory/
> /scp: /home/postgres/archive/0000009300000002000000DC: No such file or
> directory/
> /scp: /home/postgres/archive/0000009200000002000000DC: No such file or
> directory/
> /./
> /./
> /./
> /
> /
> /scp: /home/postgres/archive/0000009100000002000000DC: No such file or
> directory/
> /scp: /home/postgres/archive/0000009000000002000000DC: No such file or
> directory/
> /scp: /home/postgres/archive/00000095.history: No such file or directory/
> /[2015-12-12 20:35:11.801 IST] LOG: selected new timeline ID: 149/
> /[2015-12-12 20:35:11.931 IST] LOG: restored log file
> "00000094.history" from archive/
> /[2015-12-12 20:35:12.173 IST] LOG: archive recovery complete/
> /[2015-12-12 20:35:12.181 IST] LOG: database system is ready to accept
> connections/
> /[2015-12-12 20:35:12.181 IST] LOG: autovacuum launcher started/

It's not clear to me how you got to timeline 149. Some lines have been
removed - did the history log requests go all the way to 148?

Rsync is possibly your issue here - maybe pg_control is not being copied
because the timestamp is the same on both systems (rsync only has a 1
second time resolution so this is very possible between a master and a
streaming replica). Try rsync with checksums (--checksum) and see if
that makes a difference.

--
-David
david(at)pgmasters(dot)net

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2015-12-14 21:02:00 Re: Permissions, "soft read failure" - wishful thinking?
Previous Message David G. Johnston 2015-12-14 19:35:55 Re: Permissions, "soft read failure" - wishful thinking?