From: | Michael Harris <michael(dot)harris(at)ericsson(dot)com> |
---|---|
To: | "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org> |
Subject: | Hot Standby has PANIC: WAL contains references to invalid pages |
Date: | 2013-02-05 00:35:24 |
Message-ID: | 30BC62DC16C7B842A8446ED8EB2F0439067D1C@ESGSCMB105.ericsson.se |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi All,
We are having a thorny problem I'm hoping someone will be able to help with.
We have a pair of machines set up as an active / hot SB pair. The database they contain is quite large - approx. 9TB. They were working fine on 9.1, and we recently upgraded the active DB to 9.2.1.
After upgrading the active DB, we re-mirrored the standby (using pg_basebackup) and started it up. It began replaying the WAL files as expected.
After a few hours this happened:
WARNING: page 1 of relation pg_tblspc/16408/PG_9.2_201204301/16409/1123460086 is uninitialized
CONTEXT: xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, lastBlockVacuumed 0
PANIC: WAL contains references to invalid pages
CONTEXT: xlog redo vacuum: rel 16408/16409/1123460086; blk 4411, lastBlockVacuumed 0
LOG: startup process (PID 24195) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
We tried starting it up again, the same thing happened.
After some googling and re-reading the release notes, we noticed the mention in the 9.2.1 release notes about the potential for corrupted visibility maps, so as per the recommendation we did a full VACUUM of the whole database (with vacuum_freeze_table_age set to zero), then re-mirrored the standby again.
After re-mirroring was completed we started the standby again. Strangely it reached consistency after only 33 WAL files - since the base backup took 5 days to complete this does not seem right to me. Anyway, WAL recovery continued, with occasional warnings like this:
[2013-02-04 10:30:51 EST] 13546@ WARNING: xlog min recovery request 1A13A/9BC425A0 is past current point 19F1E/725043E8
[2013-02-04 10:30:51 EST] 13546@ CONTEXT: writing block 0 of relation pg_tblspc/16408/PG_9.2_201204301/16409/12525_vm
After a few hours, this happened:
[2013-02-04 13:43:24 EST] 13538@ WARNING: page 1248 of relation pg_tblspc/16408/PG_9.2_201204301/16409/1128746393 does not exist
[2013-02-04 13:43:24 EST] 13538@ CONTEXT: xlog redo visible: rel 16408/16409/1128746393; blk 1248
[2013-02-04 13:43:24 EST] 13538@ PANIC: WAL contains references to invalid pages
[2013-02-04 13:43:24 EST] 13538@ CONTEXT: xlog redo visible: rel 16408/16409/1128746393; blk 1248
[2013-02-04 13:43:25 EST] 13532@ LOG: startup process (PID 13538) was terminated by signal 6: Aborted
[2013-02-04 13:43:25 EST] 13532@ LOG: terminating any other active server processes
Looks similar to the first case, but a different context. We thought that perhaps an index had become corrupted (apparently also a possibility with the bug mentioned above) however the file mentioned belongs to a normal table, not an index. And 'redo visible' sounds like it might be to do with the visibility map?
We restarted it again with debugging cranked up. It didn't reveal anything more interesting. We then upgraded the standby to 9.2.2 and started it again. Again no dice. In each case it fails at exactly the same point with the same error.
Any ideas for a next troubleshooting step?
Regards // Mike
From | Date | Subject | |
---|---|---|---|
Next Message | Edson Richter | 2013-02-05 00:41:35 | Re: Reverse Engr into erwin |
Previous Message | David Johnston | 2013-02-04 23:49:45 | Re: Passing dynamic parameters to a table-returning function |