Re: Failing replication

From: Scott Whitney <scott(at)journyx(dot)com>
To: Scott Whitney <swhitney(at)journyx(dot)com>
Cc: postgres admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Failing replication
Date: 2014-06-03 00:35:43
Message-ID: 196857337.1802139.1401755743114.JavaMail.zimbra@mail.int.journyx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

rsync has completed. Cascading (#3 -> #1 via #2) is up and running again.

Log as of now:

LOG: consistent recovery state reached at 31C/1C979FD8
LOG: database system is ready to accept read only connections
LOG: record with zero length at 31C/1CCDF680
LOG: streaming replication successfully connected to primary

=== me....it's working 7:29 pm
=== scott again...restarting pg -> db01 at 6/2/14 7:33 just to see what happens.

LOG: database system was shut down in recovery at 2014-06-02 19:34:07 CDT

LOG: entering standby mode
LOG: redo starts at 31C/1F443578
LOG: consistent recovery state reached at 31C/20278980
LOG: record with incorrect prev-link 319/60278768 at 31C/20278980
LOG: database system is ready to accept read only connections
LOG: streaming replication successfully connected to primary

----- Original Message -----

> So, I have 3 servers:

> 1) Production.
> 2) "Production" training server. Replicates #1. At same data center. 2
> postmasters on separate ports.
> 3) Internal "production" server. 3 postmasters on separate ports. Replicates
> #2 and #1 (both from #2).

> This has happened to me several times in the past few weeks.

> From #3:

> LOG: entering standby mode
> LOG: redo starts at 31B/FDFFC0D8
> WARNING: page 7843 of relation base/25999670/33421425 is uninitialized
> PANIC: WAL contains references to invalid pages
> LOG: startup process (PID 4580) was terminated by signal 6: Aborted
> LOG: terminating any other active server processes

> ===

> here is me restarting...adding this since there are no timestamps in this
> log...

> ===

> LOG: database system was interrupted while in recovery at log time 2014-06-02
> 17:31:05 CDT
> HINT: If this has occurred more than once some data might be corrupted and
> you might need to choose an earlier recovery target.
> LOG: entering standby mode
> LOG: redo starts at 31B/FDFFC0D8
> WARNING: page 7843 of relation base/25999670/33421425 is uninitialized
> PANIC: WAL contains references to invalid pages
> LOG: startup process (PID 4838) was terminated by signal 6: Aborted
> LOG: terminating any other active server processes

> There is no activity in #2's pg_log postgresql-Mon.log file consummate to
> this event.

> I have wal segments keeping 1,000. I have pg_xlog files dated yesterday.

> This happened earlier today as well. It happened twice last week.

> The "warning page" stuff is different from the pg_controldata stuff earlier
> today.

> Here is pg_controldata from server 3 wrt server 1 replication.

> pg_control version number: 922
> Catalog version number: 201204301
> Database system identifier: 5868266298097360799
> Database cluster state: in archive recovery
> pg_control last modified: Mon Jun 2 18:42:18 2014
> Latest checkpoint location: 31B/FE08DB50
> Prior checkpoint location: 31B/FE08DB50
> Latest checkpoint's REDO location: 31B/FDFFC0D8
> Latest checkpoint's TimeLineID: 1
> Latest checkpoint's full_page_writes: on
> Latest checkpoint's NextXID: 0/441709396
> Latest checkpoint's NextOID: 33432042
> Latest checkpoint's NextMultiXactId: 28077
> Latest checkpoint's NextMultiOffset: 57507
> Latest checkpoint's oldestXID: 241800232
> Latest checkpoint's oldestXID's DB: 16573
> Latest checkpoint's oldestActiveXID: 441709396
> Time of latest checkpoint: Mon Jun 2 17:31:05 2014
> Minimum recovery ending location: 31C/4E1C18
> Backup start location: 0/0
> Backup end location: 0/0
> End-of-backup record required: no
> Current wal_level setting: hot_standby
> Current max_connections setting: 4000
> Current max_prepared_xacts setting: 0
> Current max_locks_per_xact setting: 64
> Maximum data alignment: 8
> Database block size: 8192
> Blocks per segment of large relation: 131072
> WAL block size: 8192
> Bytes per WAL segment: 16777216
> Maximum length of identifiers: 64
> Maximum columns in an index: 32
> Maximum size of a TOAST chunk: 1996
> Date/time type storage: 64-bit integers
> Float4 argument passing: by value
> Float8 argument passing: by value

> I have to redo the base backup and bring it back up again, so the rest of the
> stuff is about to be rsycnhed awak, but it has now failed twice today. It
> might well, again.

> What more should I provide to try to find a solution?

> Server 2 -> 1 replication has no issues.

> Server 3 -> 2 (replicating server 1) is the one that's failing.

> 65Mbps line between them.

> PG 9.28 across the board on Centos 6.5 64bit all across.

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Gabriele Lohss 2014-06-04 10:29:57 Re: Oracle Postgres Conversion
Previous Message Scott Whitney 2014-06-02 23:49:11 Failing replication