Corruption Debug Help.

From: Matthew Sellers <matt(at)indigo(dot)nu>
To: pgsql-admin(at)postgresql(dot)org
Subject: Corruption Debug Help.
Date: 2011-10-31 16:34:06
Message-ID: CACMbGu3Xx6P2TwjUuG0-tzTtiU2MV31jbqmsZ=YywSoiX4jXvg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi All,

I believe I may have experienced a Postgres bug and am eager for bit
of feedback. It seems we may have had some type of catalog corruption
as overview in the events pasted below. I am including our
observations of the problem, but am asking the list to see if I can
perform any further diagnostics or root cause analysis.

# During normal database operations I received this error while a cron
job issued a COPY on a temporary table. Further SELECTS on this table
yielded the same results. This table is toast :-)

2011-10-26 20:32:25.603 CDT helios 172.20.45.57(34663)ERROR: could
not read block 355 in file "base/16421/286173855": read only 0 of 8192
bytes

# Next we attempted to configure a hot standby server to replicate and
test possible corruption issues. After rsyncing $PG_HOME and starting
up the read-only slave, I received this error. The file 'global/11595'
does not exist on the slave or the master, further supporting the
theory of data corruption.

2011-10-31 09:31:03.682 CDT LOG: streaming replication successfully
connected to primary
2011-10-31 09:31:04.976 CDT postgres [local]FATAL: could not open
file "global/11595": Permission denied
2011-10-31 09:31:21.981 CDT postgres [local]FATAL: could not access
status of transaction 65536
2011-10-31 09:31:21.981 CDT postgres [local]DETAIL: Could not read
from file "pg_clog/0000" at offset 16384: Success.
2011-10-31 10:55:48.800 CDT helios [local]FATAL: could not access
status of transaction 65536

As a final test we are performing a pg_dump on the master which ran
successfully, and are currently restoring the dump to another machine.
This test has not yielded any errors but is far from complete given my
database size. I am runing Postgres 9.0.4 on high end hardware (
machine + SAN ) and have no indication of hardware related data loss,
so next im digging into understand the inner workings of the
Postgresql on disk format.

If anyone can suggest how to properly diagnose this type of issue it
would be greatly appreciated.

Thanks!
Matt

Browse pgsql-admin by date

  From Date Subject
Next Message Brian Fehrle 2011-10-31 23:28:32 background writer being lazy?
Previous Message Harald Fuchs 2011-10-31 15:59:02 Re: SET search path