Quick Links

Re: Postgres WAL Recovery Fails... And Then Works...

From:	KONDO Mitsumasa <kondo(dot)mitsumasa(at)lab(dot)ntt(dot)co(dot)jp>
To:	Lonni J Friedman <netllama(at)gmail(dot)com>
Cc:	Phil Monroe <phil(at)identified(dot)com>, pgsql-admin(at)postgresql(dot)org, Infrastructure <infrastructure(at)identified(dot)com>, Vladimir Giverts <vlad(at)identified(dot)com>, Tejas <tejas(at)identified(dot)com>, Sasha Kipervarg <sasha(at)identified(dot)com>
Subject:	Re: Postgres WAL Recovery Fails... And Then Works...
Date:	2013-01-15 01:08:16
Message-ID:	50F4AC00.1030805@lab.ntt.co.jp
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin

It may be PG9.2.x on HA cluster(pgsql RA on Pacemaker).

This is known bug.
When HA cluster starting, PG use recovery.conf in crash recovery.
But using recovery.conf in PG9.2, PG need database cluster that does not be crashed.

Please read under following mailing list.
[HACKERS] [BUG?] lag of minRecoveryPont in archive recovery
[BUGS] PITR potentially broken in 9.2

Latest-dev(next release version 9.2.3)is fixed this bug.

(2013/01/15 5:02), Lonni J Friedman wrote:
> Your errors look somewhat similar to a problem I reported last week
> (no replies thus far):
> http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com
>
> Except in my case no number of restarts helped. You didn't say, were
> you explicitly copying $PGDATA or using some other mechanism to
> migrate the data elsewhere?
>
> Also, which version of postgres are you using?
>
> On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe <phil(at)identified(dot)com> wrote:
>>
>> Hi Everyone,
>>
>> So we had to failover and do a full base backup to get our slave database
>> back online and ran into a interesting scenario. After copying the data
>> directory, setting up the recovery.conf, and starting the slave database,
>> the database crashes while replaying xlogs. However, trying to start the
>> database again, the database is able to replay xlogs farther than it
>> initially got, but ultimately ended up failing out again. After starting the
>> DB a third time, PostgreSQL replays even further and catches up to the
>> master to start streaming replication. Is this common and or acceptable?
>>
>> The error messages that cause PostgreSQL to stop replaying xlogs:
>>
>> ERROR #1:
>> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING: page 2387476480 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC: WAL contains references to
>> invalid pages
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG: startup process (PID 61150) was
>> terminated by signal 6: Aborted
>>
>> ERROR #2:
>> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING: page 2387607552 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC: WAL contains references to
>> invalid pages
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG: startup process (PID 36831) was
>> terminated by signal 6: Aborted
>>
>>
>> Fortunately, these errors only pertain to indexes, which can be rebuilt.
>> But is this a sign that the data directory on the slave is corrupt?
>>
>> Here is a full timeline of the logs highlighting the issues. :
>>
>> 1. Data Directory Copy Finishes.
>> 2. Recovery.conf Setup
>>
>> 3. Start #1:
>>
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: database system was shut down in
>> recovery at 2013-01-12 00:14:06 UTC
>> 2013-01-12 00:14:23 UTC:[local]:[unknown](at)[unknown]:[61151]: LOG:
>> incomplete startup packet
>> 2013-01-12 00:14:23 UTC:[local]:postgres(at)postgres:[61152]: FATAL: the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: entering standby mode
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
>> "0000000900008E45000000B8" from archive
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
>> "0000000900008E450000008B" from archive
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: redo starts at 8E45/8B173180
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: consistent recovery state
>> reached at 8E45/8B174840
>> 2013-01-12 00:14:23 UTC:[local]:postgres(at)postgres:[61175]: FATAL: the
>> database system is starting up
>> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
>> "0000000900008E450000008C" from archive
>> 2013-01-12 00:14:24 UTC::@:[61150]: LOG: restored log file
>> "0000000900008E450000008D" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 00:17:17 UTC::@:[61150]: LOG: restored log file
>> "0000000900008E4800000066" from archive
>> 2013-01-12 00:17:17 UTC::@:[61150]: LOG: restored log file
>> "0000000900008E4800000067" from archive
>> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING: page 2387476480 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC: WAL contains references to
>> invalid pages
>> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
>> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG: startup process (PID 61150) was
>> terminated by signal 6: Aborted
>> 2013-01-12 00:17:18 UTC::@:[61121]: LOG: terminating any other active
>> server processes
>>
>>
>> 4. PostgreSQL shuts down...
>> 5. Debugging logs enabled in postgresql.conf.
>>
>>
>> 6. Start #2:
>>
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: database system was interrupted
>> while in recovery at log time 2013-01-11 18:05:31 UTC
>> 2013-01-12 00:51:48 UTC::@:[36831]: HINT: If this has occurred more than
>> once some data might be corrupted and you might need to choose an earlier
>> recovery target.
>> 2013-01-12 00:51:48 UTC:[local]:[unknown](at)[unknown]:[36832]: LOG:
>> incomplete startup packet
>> 2013-01-12 00:51:48 UTC:[local]:postgres(at)postgres:[36833]: FATAL: the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: entering standby mode
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E45000000B8" from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: checkpoint record is at
>> 8E45/B80AF650
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: redo record is at
>> 8E45/8B173180; shutdown FALSE
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: next transaction ID:
>> 0/552803703; next OID: 24427698
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: next MultiXactId: 109027; next
>> MultiXactOffset: 2442921
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: oldest unfrozen transaction
>> ID: 3104202601, in database 16408
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: transaction ID wrap limit is
>> 956718952, limited by database with OID 16408
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: resetting unlogged relations:
>> cleanup 1 init 0
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: initializing for hot standby
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E450000008B" from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: redo starts at 8E45/8B173180
>> 2013-01-12 00:51:49 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E450000008C" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4800000062" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4800000063" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4800000064" from archive
>> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4800000065" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4800000066" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4800000067" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: consistent recovery state
>> reached at 8E48/67AC4E28
>> 2013-01-12 00:54:46 UTC::@:[36796]: LOG: database system is ready to
>> accept read only connections
>> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4800000068" from archive
>> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4800000069" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E480000006A" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E480000006B" from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E480000006C" from archive
>>
>> *SNIP*
>>
>> 2013-01-12 01:02:16 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4F00000079" from archive
>> 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: LOG: restored log file
>> "0000000900008E4F0000007A" from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING: page 2387607552 of relation
>> base/16408/18967399 does not exist
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC: WAL contains references to
>> invalid pages
>> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
>> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG: startup process (PID 36831) was
>> terminated by signal 6: Aborted
>> 2013-01-12 01:02:20 UTC::@:[36796]: LOG: terminating any other active
>> server processes
>>
>>
>> 7. PostgreSQL shuts down...
>>
>>
>> 8. Start #3:
>>
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG: database system was interrupted
>> while in recovery at log time 2013-01-11 19:50:31 UTC
>> 2013-01-12 01:04:04 UTC::@:[61742]: HINT: If this has occurred more than
>> once some data might be corrupted and you might need to choose an earlier
>> recovery target.
>> 2013-01-12 01:04:04 UTC:[local]:[unknown](at)[unknown]:[61743]: LOG:
>> incomplete startup packet
>> 2013-01-12 01:04:04 UTC:[local]:postgres(at)postgres:[61744]: FATAL: the
>> database system is starting up
>> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG: entering standby mode
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4A00000039" from archive
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: checkpoint record is at
>> 8E4A/39CD4BA0
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: redo record is at
>> 8E4A/19F0D210; shutdown FALSE
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: next transaction ID:
>> 0/552859005; next OID: 24427698
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: next MultiXactId: 109221; next
>> MultiXactOffset: 2443321
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: oldest unfrozen transaction
>> ID: 3104202601, in database 16408
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: transaction ID wrap limit is
>> 956718952, limited by database with OID 16408
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: resetting unlogged relations:
>> cleanup 1 init 0
>> 2013-01-12 01:04:04 UTC::@:[61742]: LOG: could not open directory
>> "pg_snapshots": No such file or directory
>> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: initializing for hot standby
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4A00000019" from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG: redo starts at 8E4A/19F0D210
>> 2013-01-12 01:04:05 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4A0000001A" from archive
>> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG: got WAL segment from archive
>>
>> *SNIP*
>>
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4F00000077" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4F00000078" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4F00000079" from archive
>> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4F0000007A" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: consistent recovery state
>> reached at 8E4F/7A22BD08
>> 2013-01-12 01:09:38 UTC::@:[61729]: LOG: database system is ready to
>> accept read only connections
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4F0000007B" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4F0000007C" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4F0000007D" from archive
>> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:09:39 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E4F0000007E" from archive
>>
>>
>> *SNIP*
>>
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E53000000D9" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E53000000DA" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E53000000DB" from archive
>> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E53000000DC" from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG: restored log file
>> "0000000900008E53000000DD" from archive
>> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG: got WAL segment from archive
>> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
>> 2013-01-12 01:14:30 UTC::@:[61742]: LOG: unexpected pageaddr 8E4A/1000000
>> in log file 36435, segment 222, offset 0
>> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
>> 2013-01-12 01:14:30 UTC::@:[19504]: LOG: streaming replication
>> successfully connected to primary
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=1
>> file=base/16408/22873432 time=2.538 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=2
>> file=base/16408/18967506 time=12.054 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=3
>> file=base/16408/18967506_fsm time=0.095 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=4
>> file=base/16408/22873244 time=0.144 msec
>> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=5
>> file=base/16408/22892823 time=0.087 msec
>>
>> 9. Slave DB connected to streaming replication with Master DB and all
>> seems fine.
>>
>>
>> Any help would be appreciated. Thanks!
>
>
--
NTT OSS Center
Mitsumasa KONDO

In response to

Re: Postgres WAL Recovery Fails... And Then Works... at 2013-01-14 20:02:34 from Lonni J Friedman

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Rajagopalan, Jayashree	2013-01-15 05:11:36	Casting bytea to varchar
Previous Message	Phil Monroe	2013-01-14 23:45:09	Re: Postgres WAL Recovery Fails... And Then Works...