Corruption with PITR restore

From: Jorge Torralba <jorge(dot)torralba(at)gmail(dot)com>
To: pgsql-admin(at)postgresql(dot)org
Subject: Corruption with PITR restore
Date: 2013-03-13 01:28:39
Message-ID: CACut7uR=k4zk1wXeSV=iFrLDsvZsuot__Bdh1hBwQt1QCW4a+A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Experiencing serious issues with PITR restore.

We are using postgres 8.3.9 on centos 5.6. We have archiving turned on and
write to an NFS share with a very simple archive command of

"cp -i %p /path/to/nfsshare/%f </dev/null"

we execute

select pg_start_backup('mylabel');

once we get the succes,

we tar up the cluster dir

when completed, we

select pg_stop_backup()

and go on our merry way.

The other night we migrated to a new environment and copied the tar file to
the new environment and extracted it there. We shut down the existing
postgres on the old environment and copied the archived wal files and the
files in the pg_xlog to the new server. This process took place 3 days
after the initial tar was taken. By this time we had about 1200 wal files.
We replaced the pg_xlog files in the new env with the ones we just copied
from the shut down server. we had our recovery.conf file simply pointing to
the archive wal directory with no target time and started postgres. Sure
enough, all the wal files played and we got our database is ready to accept
connections. We tested and everything looked fine.

All hell broke lose the next day, missing chunk 0, unexpected chunk, bad
siblings, chunks in toast table screwed up etc ... It has been a nightmare.
Could not even execute a pg_dumpall. had to spend days looking for rows and
updating them so eventually the pg_dump worked. I turned on the old server
for validating and the corruption was not there.

What has caused this ? our wal sync method is by default set to fdatasync.
The original server was on a red hat cluster with a GFS file system and the
server could never shut down gracefully when the sysadmins shut it down. It
was always an immediate. This is because of the cluster config which we
moved off of.

Any help would be appreciated.

Thanks!!!

JT

Browse pgsql-admin by date

  From Date Subject
Next Message Gabriel E. Sánchez Martínez 2013-03-13 17:36:23 tables mysteriously truncated
Previous Message Eduardo Morras 2013-03-07 10:18:34 Re: HOT Standby - slave does not appear to be removing wal files