Re: Hot-Standby resync problem after connection loss

From: Brian Weaver <cmdrclueless(at)gmail(dot)com>
To: Stefan Kohlhauser <stefan(dot)kohlhauser(at)gmx(dot)net>
Cc: pgsql-admin(at)postgresql(dot)org
Subject: Re: Hot-Standby resync problem after connection loss
Date: 2014-01-10 17:47:42
Message-ID: CAAhXZGv4bFzgUNGTWkvcRh7HopAfnoB_RKjaA_40Wm4U3C1NRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Sounds like it might be related to a know data corruption issue found in
9.2.5. See the following thread from the list archive

http://www.postgresql.org/message-id/0E76EE0A-1740-41B5-88FF-54AA98794532@thebuild.com

On Thu, Jan 9, 2014 at 5:00 AM, Stefan Kohlhauser <stefan(dot)kohlhauser(at)gmx(dot)net
> wrote:

> Hello everyone!
>
> I have set up a PostgreSQL 9.2.5 in hot-standby with a master and one
> standby on two virtualized RHEL6.4. I am using them for a high-availability
> Kamailio SIP-server. Everything is managed by Pacemaker/Corosync with two
> network connections, one for Pacemaker, one for PostgreSQL.
> After setup replication works fine.
>
> postgresql.conf excerpt (same on both nodes):
> wal_level = hot_standby
> synchronous_commit = off
> wal_buffers = 1MB
> archive_mode = on
> archive_command = 'test ! -f /var/lib/kamailiodb1/archive/%f && cp %p
> /var/lib/kamailiodb1/archive/%f'
> max_wal_senders = 16
> replication_timeout = 10s
> hot_standby = on
> max_standby_archive_delay = -1
> max_standby_streaming_delay = -1
> wal_receiver_status_interval = 3s
> hot_standby_feedback = on
>
> However, I have a test case that troubles me:
> Node 1 runs the master, node 2 the standby. I remove the network
> connection for PostgreSQL (the Pacemaker instances still see each other).
> After a few seconds I poweroff node 1. This causes node 2 to be promoted to
> master by Pacemaker. I add a new insert on node 2. Then I boot node 1. From
> the log entries on node 1 they seem to resync:
> 2014-01-08T16:30:43.698+00:00 kamailionode1 postgres-kamailio[1980] info:
> [1-1] LOG: database system was shut down in recovery at 2014-01-08 16:29:14
> GMT
> 2014-01-08T16:30:43.843+00:00 kamailionode1 postgres-kamailio[1980] info:
> [2-1] LOG: entering standby mode
> 2014-01-08T16:30:44.024+00:00 kamailionode1 postgres-kamailio[1980] info:
> [3-1] LOG: consistent recovery state reached at 0/6000080
> 2014-01-08T16:30:44.025+00:00 kamailionode1 postgres-kamailio[1965] info:
> [1-1] LOG: database system is ready to accept read only connections
> 2014-01-08T16:30:44.025+00:00 kamailionode1 postgres-kamailio[1980] info:
> [4-1] LOG: record with zero length at 0/6000080
> 2014-01-08T16:30:44.272+00:00 kamailionode1 postgres-kamailio[1998] info:
> [2-1] LOG: streaming replication successfully connected to primary
>
> However, the new insert is not shown on node 1 when i query the DB with
> psql, on node 2 it is shown.
>
> recovery.conf:
> standby_mode = 'on'
> primary_conninfo = 'host=pgreplicationha port=5432 user=replicate
> application_name=kamailionode2 '
> restore_command = 'scp -o ConnectTimeout=10 -o
> UserKnownHostsFile=/dev/null -o StrictHostkeyChecking=no
> pgreplicationha:/var/lib/kamailiodb1/archive/%f %p'
> recovery_target_timeline = 'latest'
>
> I'm not sure I understand this and if it has anything to do with it, but
> before the connection loss node 1 was using WAL 000000010000000000000005.
> Now "ps aux" on node 1 tells me:
> postgres: startup process waiting for 000000010000000000000006
> Doesn't that mean he is waiting for information from
> 000000010000000000000006 which isn't even used by the master yet, according
> to pg_controldata?
>
> pg_controldata excerpt on node 1 (now standby):
> Database cluster state: in archive recovery
> Latest checkpoint location: 0/6000020
> Prior checkpoint location: 0/6000020
> Latest checkpoint's REDO location: 0/6000020
> Minimum recovery ending location: 0/6000020
>
> pg_controldata excerpt on node 2 (now master):
> Database cluster state: in production
> Latest checkpoint location: 0/519BB68
> Prior checkpoint location: 0/5187F38
> Latest checkpoint's REDO location: 0/519BB68
> Minimum recovery ending location: 0/0
>
> I know this could be a potential split-brain (which internally it is
> regarding timelines if I understood correctly; however, the timeline part
> of the WAL stays 00000001xxxxx(?)).
> I need a way to fix this automatically by resyncing to what the current
> master holds.
> The DB will contain only a few entries and DB modifications are probably
> rare. It is much more important that the overall downtime of the system is
> as low as possible than maybe losing one or two entries.
> Therefore my question: How do I resync the new standby node properly
> without restarting/rebooting/moving the master?
>
> Best regards and thanks in advance,
> Stefan
>

--

/* insert witty comment here */

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Simon Riggs 2014-01-11 08:26:12 Re: Hot-Standby resync problem after connection loss
Previous Message Tom Lane 2014-01-10 16:15:14 Re: [ADMIN] Column missing from pg_statistics