corrupted item pointer in streaming based replication

From: Jigar Shah <jshah(at)pandora(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: corrupted item pointer in streaming based replication
Date: 2013-04-03 20:02:49
Message-ID: 1E737D138B89104D8A7853F7DD23177DB6320F@SF1-EXMBX-2.ad.savagebeast.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

Postgres version = 9.1.2
OS = debian(6.0.7)
fsync = on
full_page_writes = on
Setup = Primary and streaming replication based secondary

Few days ago we had a situation where our Primary started to through the error messages below indicating corruption in the database. It crashed sometimes and showed a panic message in the logs

2013-03-25 07:30:39.545 PDT PANIC: corrupted item pointer: offset = 0, size = 0
2013-03-25 07:30:39.704 PDT LOG: server process (PID 8715) was terminated by signal 6: Aborted
2013-03-25 07:30:39.704 PDT LOG: terminating any other active server processes

Days before it started to crash it showed the below error messages in the logs.

[d: u:postgres p:2498 7] ERROR: could not access status of transaction 837550133
DETAIL: Could not open file "pg_clog/031E": No such file or directory. [u:postgres p:2498 9]

[d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39" (target block 360448000): No such file or directory [d: u:radio p:31917 243]

On top of that, our secondaries are now crashed and would not startup and showed the error messages below in pg logs.

2013-03-27 11:00:47.281 PDT LOG: recovery restart point at 161A/17108AA8
2013-03-27 11:00:47.281 PDT DETAIL: last completed transaction was at log time 2013-03-27 11:00:47.241236-07
2013-03-27 11:00:47.520 PDT LOG: restartpoint starting: xlog

2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, size = 0
2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, firstright 192
2013-03-27 11:07:51.716 PDT LOG: startup process (PID 5959) exited with exit code 1
2013-03-27 11:07:51.716 PDT LOG: terminating any other active server processes

At this point we have a running but corrupt primary and crashed secondary that wont startup.

I am wondering what are our options at this point. Can we do something to fix this? How can we recover from corruption.

Thanks for help in advance.

Regards
Jigar

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Lonni J Friedman 2013-04-03 20:06:13 Re: corrupted item pointer in streaming based replication
Previous Message Merlin Moncure 2013-04-03 19:25:19 Re: Dynamic/polymorphic record/composite return types for C user-defined-functions