Detecting corruption

From: Anthony Nowocien <anowocien(at)gmail(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Detecting corruption
Date: 2014-12-09 07:55:14
Message-ID: CAH5RRoM1MzAeDMersv92U-=aOQaJUf61v6vpr+ZpXCZG5eps-A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi everyone,

articles from the wiki about corruption, Christophe Petus' talk in PgCon
Eu2014 were of great interest but I'm still slightly at a loss in an issue
I've faced recently.

Please assume the following:
- Version 9.0.18 on Debian boxes;
- Master/slave integrated replication and a back and forth failover was
executed;
- Fine recent hardware;
- Backups, restores all working fine and no corruption that I could detect;
- fsync was never touched, no power loss, no funny issue in
/var/log/syslog, no recent file system check

I have found two issues that must have been caused by our custom failover
procedure, as both instances gave similar results for the same objects

1) Corrupted index
*select a from t1 where a = 10;* gave the rather puzzling
a
1
2
10

For every other value I could test, I had correct results. The previous
query was using an index, and rebuilding it finally gave me expected
results. Data corruption on indexes is annoying, even more as it can give
rather puzzling results, but rebuilding is enough in order to recover.

2) Corrupted table
*select b from t2* worked fine on this rather write intensive table, with a
being a timestamp.
*select max(b) from t2* gave me the unfortunate error "Could not read Block
X of relation base/Y/Z". In retrospect, this might have also been an index
issue and I still have some analyze to do to pinpoint it.

I know that PostgreSQL is very trusting regarding the underlying file
system, but I would also like to diagnose those corruption issues as early
as possible. Only very specific queries gave errors. As said above, pg_dump
(database wise or table wise) worked fine, restores went fine and the main
log did not contain any glaring errors before the unfortunate "Could not
read Block X". I thought that pg_dump, reading every single row in the
database, would help in finding corruption... But such was not the case.

3rd party projects exists, 9.3 allows checksums (but at a rather hefty
price it seems - how usable is it? ), but is there any way I could have
detected this corruption issue with 9.0? Do you also have advice on such
(semi) silent corruption?

Thanks,

Anthony

Browse pgsql-general by date

  From Date Subject
Next Message Albe Laurenz 2014-12-09 10:24:47 Re: Use cases for lateral that do not involve a set returning function
Previous Message David Johnston 2014-12-09 05:47:46 Re: FW: SQL rolling window without aggregation