Re: Postgres WAL Recovery Fails... And Then Works...

From: Lonni J Friedman <netllama(at)gmail(dot)com>
To: Phil Monroe <phil(at)identified(dot)com>
Cc: pgsql-admin(at)postgresql(dot)org, Infrastructure <infrastructure(at)identified(dot)com>, Vladimir Giverts <vlad(at)identified(dot)com>, Tejas <tejas(at)identified(dot)com>, Sasha Kipervarg <sasha(at)identified(dot)com>
Subject: Re: Postgres WAL Recovery Fails... And Then Works...
Date: 2013-01-14 20:02:34
Message-ID: CAP=oouEW1bb2r4vgkn8zCuQvhYROahXsjhpcvyepuTKkKHWsbQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Your errors look somewhat similar to a problem I reported last week
(no replies thus far):
http://www.postgresql.org/message-id/CAP=oouE5niXgAO_34Q+FGq=tS6EsMHXXRRoByov3zn9pU4249w@mail.gmail.com

Except in my case no number of restarts helped. You didn't say, were
you explicitly copying $PGDATA or using some other mechanism to
migrate the data elsewhere?

Also, which version of postgres are you using?

On Fri, Jan 11, 2013 at 6:32 PM, Phil Monroe <phil(at)identified(dot)com> wrote:
>
> Hi Everyone,
>
> So we had to failover and do a full base backup to get our slave database
> back online and ran into a interesting scenario. After copying the data
> directory, setting up the recovery.conf, and starting the slave database,
> the database crashes while replaying xlogs. However, trying to start the
> database again, the database is able to replay xlogs farther than it
> initially got, but ultimately ended up failing out again. After starting the
> DB a third time, PostgreSQL replays even further and catches up to the
> master to start streaming replication. Is this common and or acceptable?
>
> The error messages that cause PostgreSQL to stop replaying xlogs:
>
> ERROR #1:
> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING: page 2387476480 of relation
> base/16408/18967399 does not exist
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC: WAL contains references to
> invalid pages
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG: startup process (PID 61150) was
> terminated by signal 6: Aborted
>
> ERROR #2:
> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING: page 2387607552 of relation
> base/16408/18967399 does not exist
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC: WAL contains references to
> invalid pages
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG: startup process (PID 36831) was
> terminated by signal 6: Aborted
>
>
> Fortunately, these errors only pertain to indexes, which can be rebuilt.
> But is this a sign that the data directory on the slave is corrupt?
>
> Here is a full timeline of the logs highlighting the issues. :
>
> 1. Data Directory Copy Finishes.
> 2. Recovery.conf Setup
>
> 3. Start #1:
>
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: database system was shut down in
> recovery at 2013-01-12 00:14:06 UTC
> 2013-01-12 00:14:23 UTC:[local]:[unknown](at)[unknown]:[61151]: LOG:
> incomplete startup packet
> 2013-01-12 00:14:23 UTC:[local]:postgres(at)postgres:[61152]: FATAL: the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: entering standby mode
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
> "0000000900008E45000000B8" from archive
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
> "0000000900008E450000008B" from archive
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: redo starts at 8E45/8B173180
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: consistent recovery state
> reached at 8E45/8B174840
> 2013-01-12 00:14:23 UTC:[local]:postgres(at)postgres:[61175]: FATAL: the
> database system is starting up
> 2013-01-12 00:14:23 UTC::@:[61150]: LOG: restored log file
> "0000000900008E450000008C" from archive
> 2013-01-12 00:14:24 UTC::@:[61150]: LOG: restored log file
> "0000000900008E450000008D" from archive
>
> *SNIP*
>
> 2013-01-12 00:17:17 UTC::@:[61150]: LOG: restored log file
> "0000000900008E4800000066" from archive
> 2013-01-12 00:17:17 UTC::@:[61150]: LOG: restored log file
> "0000000900008E4800000067" from archive
> 2013-01-12 00:17:17 UTC::@:[61150]: WARNING: page 2387476480 of relation
> base/16408/18967399 does not exist
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:17 UTC::@:[61150]: PANIC: WAL contains references to
> invalid pages
> 2013-01-12 00:17:17 UTC::@:[61150]: CONTEXT: xlog redo delete: index
> 1663/16408/22892842; iblk 658355, heap 1663/16408/18967399;
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG: startup process (PID 61150) was
> terminated by signal 6: Aborted
> 2013-01-12 00:17:18 UTC::@:[61121]: LOG: terminating any other active
> server processes
>
>
> 4. PostgreSQL shuts down...
> 5. Debugging logs enabled in postgresql.conf.
>
>
> 6. Start #2:
>
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: database system was interrupted
> while in recovery at log time 2013-01-11 18:05:31 UTC
> 2013-01-12 00:51:48 UTC::@:[36831]: HINT: If this has occurred more than
> once some data might be corrupted and you might need to choose an earlier
> recovery target.
> 2013-01-12 00:51:48 UTC:[local]:[unknown](at)[unknown]:[36832]: LOG:
> incomplete startup packet
> 2013-01-12 00:51:48 UTC:[local]:postgres(at)postgres:[36833]: FATAL: the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: entering standby mode
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: restored log file
> "0000000900008E45000000B8" from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: checkpoint record is at
> 8E45/B80AF650
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: redo record is at
> 8E45/8B173180; shutdown FALSE
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: next transaction ID:
> 0/552803703; next OID: 24427698
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: next MultiXactId: 109027; next
> MultiXactOffset: 2442921
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: oldest unfrozen transaction
> ID: 3104202601, in database 16408
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: transaction ID wrap limit is
> 956718952, limited by database with OID 16408
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: resetting unlogged relations:
> cleanup 1 init 0
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: initializing for hot standby
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: restored log file
> "0000000900008E450000008B" from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:51:48 UTC::@:[36831]: LOG: redo starts at 8E45/8B173180
> 2013-01-12 00:51:49 UTC::@:[36831]: LOG: restored log file
> "0000000900008E450000008C" from archive
>
> *SNIP*
>
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4800000062" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4800000063" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4800000064" from archive
> 2013-01-12 00:54:45 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4800000065" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4800000066" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4800000067" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: consistent recovery state
> reached at 8E48/67AC4E28
> 2013-01-12 00:54:46 UTC::@:[36796]: LOG: database system is ready to
> accept read only connections
> 2013-01-12 00:54:46 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4800000068" from archive
> 2013-01-12 00:54:46 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4800000069" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
> "0000000900008E480000006A" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
> "0000000900008E480000006B" from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 00:54:47 UTC::@:[36831]: LOG: restored log file
> "0000000900008E480000006C" from archive
>
> *SNIP*
>
> 2013-01-12 01:02:16 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4F00000079" from archive
> 2013-01-12 01:02:16 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: LOG: restored log file
> "0000000900008E4F0000007A" from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: DEBUG: got WAL segment from archive
> 2013-01-12 01:02:17 UTC::@:[36831]: WARNING: page 2387607552 of relation
> base/16408/18967399 does not exist
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:17 UTC::@:[36831]: PANIC: WAL contains references to
> invalid pages
> 2013-01-12 01:02:17 UTC::@:[36831]: CONTEXT: xlog redo delete: index
> 1663/16408/22892841; iblk 1075350, heap 1663/16408/18967399;
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG: startup process (PID 36831) was
> terminated by signal 6: Aborted
> 2013-01-12 01:02:20 UTC::@:[36796]: LOG: terminating any other active
> server processes
>
>
> 7. PostgreSQL shuts down...
>
>
> 8. Start #3:
>
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG: database system was interrupted
> while in recovery at log time 2013-01-11 19:50:31 UTC
> 2013-01-12 01:04:04 UTC::@:[61742]: HINT: If this has occurred more than
> once some data might be corrupted and you might need to choose an earlier
> recovery target.
> 2013-01-12 01:04:04 UTC:[local]:[unknown](at)[unknown]:[61743]: LOG:
> incomplete startup packet
> 2013-01-12 01:04:04 UTC:[local]:postgres(at)postgres:[61744]: FATAL: the
> database system is starting up
> unpigz: /mnt/db/wals/00000009.history does not exist -- skipping
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG: entering standby mode
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4A00000039" from archive
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: checkpoint record is at
> 8E4A/39CD4BA0
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: redo record is at
> 8E4A/19F0D210; shutdown FALSE
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: next transaction ID:
> 0/552859005; next OID: 24427698
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: next MultiXactId: 109221; next
> MultiXactOffset: 2443321
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: oldest unfrozen transaction
> ID: 3104202601, in database 16408
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: transaction ID wrap limit is
> 956718952, limited by database with OID 16408
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: resetting unlogged relations:
> cleanup 1 init 0
> 2013-01-12 01:04:04 UTC::@:[61742]: LOG: could not open directory
> "pg_snapshots": No such file or directory
> 2013-01-12 01:04:04 UTC::@:[61742]: DEBUG: initializing for hot standby
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4A00000019" from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG: redo starts at 8E4A/19F0D210
> 2013-01-12 01:04:05 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4A0000001A" from archive
> 2013-01-12 01:04:05 UTC::@:[61742]: DEBUG: got WAL segment from archive
>
> *SNIP*
>
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4F00000077" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4F00000078" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4F00000079" from archive
> 2013-01-12 01:09:37 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4F0000007A" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: consistent recovery state
> reached at 8E4F/7A22BD08
> 2013-01-12 01:09:38 UTC::@:[61729]: LOG: database system is ready to
> accept read only connections
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4F0000007B" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4F0000007C" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4F0000007D" from archive
> 2013-01-12 01:09:38 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:09:39 UTC::@:[61742]: LOG: restored log file
> "0000000900008E4F0000007E" from archive
>
>
> *SNIP*
>
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
> "0000000900008E53000000D9" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
> "0000000900008E53000000DA" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: LOG: restored log file
> "0000000900008E53000000DB" from archive
> 2013-01-12 01:14:29 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG: restored log file
> "0000000900008E53000000DC" from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG: got WAL segment from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG: restored log file
> "0000000900008E53000000DD" from archive
> 2013-01-12 01:14:30 UTC::@:[61742]: DEBUG: got WAL segment from archive
> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
> 2013-01-12 01:14:30 UTC::@:[61742]: LOG: unexpected pageaddr 8E4A/1000000
> in log file 36435, segment 222, offset 0
> unpigz: /mnt/db/wals/0000000900008E53000000DE does not exist -- skipping
> 2013-01-12 01:14:30 UTC::@:[19504]: LOG: streaming replication
> successfully connected to primary
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=1
> file=base/16408/22873432 time=2.538 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=2
> file=base/16408/18967506 time=12.054 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=3
> file=base/16408/18967506_fsm time=0.095 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=4
> file=base/16408/22873244 time=0.144 msec
> 2013-01-12 01:18:08 UTC::@:[61772]: DEBUG: checkpoint sync: number=5
> file=base/16408/22892823 time=0.087 msec
>
> 9. Slave DB connected to streaming replication with Master DB and all
> seems fine.
>
>
> Any help would be appreciated. Thanks!

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Phil Monroe 2013-01-14 23:45:09 Re: Postgres WAL Recovery Fails... And Then Works...
Previous Message Casey Allen Shobe 2013-01-14 17:43:11 Re: How to implement the 'GRANT ALL ON DB1 TO 'USER'@'%' of MYSQL in PostgreSQL