Quick Links

corrupted item pointer in streaming based replication

From:	Jigar Shah <jshah(at)pandora(dot)com>
To:	"pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject:	corrupted item pointer in streaming based replication
Date:	2013-04-03 20:02:49
Message-ID:	1E737D138B89104D8A7853F7DD23177DB6320F@SF1-EXMBX-2.ad.savagebeast.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hi,

Postgres version = 9.1.2
OS = debian(6.0.7)
fsync = on
full_page_writes = on
Setup = Primary and streaming replication based secondary

Few days ago we had a situation where our Primary started to through the error messages below indicating corruption in the database. It crashed sometimes and showed a panic message in the logs

2013-03-25 07:30:39.545 PDT PANIC: corrupted item pointer: offset = 0, size = 0
2013-03-25 07:30:39.704 PDT LOG: server process (PID 8715) was terminated by signal 6: Aborted
2013-03-25 07:30:39.704 PDT LOG: terminating any other active server processes

Days before it started to crash it showed the below error messages in the logs.

[d: u:postgres p:2498 7] ERROR: could not access status of transaction 837550133
DETAIL: Could not open file "pg_clog/031E": No such file or directory. [u:postgres p:2498 9]

[d: u:radio p:31917 242] ERROR: could not open file "base/16384/114846.39" (target block 360448000): No such file or directory [d: u:radio p:31917 243]

On top of that, our secondaries are now crashed and would not startup and showed the error messages below in pg logs.

2013-03-27 11:00:47.281 PDT LOG: recovery restart point at 161A/17108AA8
2013-03-27 11:00:47.281 PDT DETAIL: last completed transaction was at log time 2013-03-27 11:00:47.241236-07
2013-03-27 11:00:47.520 PDT LOG: restartpoint starting: xlog

2013-03-27 11:07:51.348 PDT FATAL: corrupted item pointer: offset = 0, size = 0
2013-03-27 11:07:51.348 PDT CONTEXT: xlog redo split_l: rel 1663/16384/115085 left 4256959, right 5861610, next 5044459, level 0, firstright 192
2013-03-27 11:07:51.716 PDT LOG: startup process (PID 5959) exited with exit code 1
2013-03-27 11:07:51.716 PDT LOG: terminating any other active server processes

At this point we have a running but corrupt primary and crashed secondary that wont startup.

I am wondering what are our options at this point. Can we do something to fix this? How can we recover from corruption.

Thanks for help in advance.

Regards
Jigar

Responses

Re: corrupted item pointer in streaming based replication at 2013-04-03 20:06:13 from Lonni J Friedman
Re: corrupted item pointer in streaming based replication at 2013-04-03 20:18:52 from Tom Lane

Browse pgsql-general by date

	From	Date	Subject
Next Message	Lonni J Friedman	2013-04-03 20:06:13	Re: corrupted item pointer in streaming based replication
Previous Message	Merlin Moncure	2013-04-03 19:25:19	Re: Dynamic/polymorphic record/composite return types for C user-defined-functions