Re: Trying to recover a corrupted database

From: John Scalia <jayknowsunix(at)gmail(dot)com>
To: Scott Whitney <swhitney(at)journyx(dot)com>, "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Trying to recover a corrupted database
Date: 2014-07-17 20:05:52
Message-ID: 53C82CA0.6050607@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

I was attempting to do that with one table that I know is broken, but I can't get anything from it in a select. All I get no matter what is:

ERROR: could not access status of transaction 3706060803
DETAIL: Could not open file "pg_clog/OCDE": No such file or directory

I think the pg_clog files are transaction journal files that tell the Db why a record is or is not in the table, but I'm not certain. I just wish there was a pg_resetclog
executable like there is for the pg_xlog segments.

On 7/17/2014 3:58 PM, Scott Whitney wrote:
> Yeah. I know the feeling, man. What I ended up doing was "select * from" until I found the broken one then jumped down into groups of thousands of records and basically
> "Newtoned" it down until I found the affected data.
>
> It sucked. A lot.
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Well, we already have the schema in a canned file. So, we wouldn't need to do a pg_dump for the schema, and I can get into the database using psql obviously. That's how I
> produced the setting from pg_settings. But, I really don't want to have to go through 150K+ records to find the multiple broken ones, if I don't have to do so. I'm just
> hoping for a faster way.
>
> On 7/17/2014 3:30 PM, Scott Whitney wrote:
>
> Well, at this point, let me let the big brains answer your main question on the thread and tell you what _I_ did when I got into a similar situation.
>
> a) pg_dump --schema-only <database name> > db.structure
> b) From here it was a lot of select * from table until I found the broken one(s) and I worked around the actual bad data by offset/limit on the selects on the busted tables.
>
> It took me several hours, but it did work (for me).
>
> That's assuming you can even psql to the db in the first place when the postmaster is running.
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> We're only working with a single database on this system, and yes, I attempted a pg_dump earlier today on just one single table - it failed (I know pg_dump and
> pg_basebackup use different mechanisms.) Mind you it's a large table with 154,000 rows in it. The overall database is somewhere around 43GB.
>
> On 7/17/2014 3:16 PM, Scott Whitney wrote:
>
> a) How many databases do you have?
> b) Have you tried to pg_dump the database(s) to see if that succeeds?
>
> If you _can_ pg_dump (which you might or might not be able to do), you could re-init the cluster and restore.
>
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> Hi all,
>
> You may have seen my post from yesterday about our production database getting corrupted. Well, this morning we brought the system down to single user and ran
> an fsck which did
> report some drive errors. We repeated until no additional errors were reported. Then, we brought the system back to multi-user status and ran a successful
> pg_basebackup on the
> broken database. Since then we restarted the database and a ps -ef result looks like:
>
> /usr/pgsql-9.2/bin/postmaster -D /opt/datacenter -o -c zero_damaged_pages=true -i -N 384 -p 5431
>
> After the Db started up, we ran a VACUUM FULL ANALYZE which ran for about 3 hours, but the database is still showing the same type of errors in its log:
> invalid page header in
> block 29718... etc. What disturbed me a little, is that I don't think the zero_damaged_pages got applied. Checking the pg_settings table, we got:
>
> select name, setting, boot_val, reset_val from pg_settings where name = 'zero_damaged_pages';
> name | setting | boot_val | reset_val
> ---------------------------------------------------------------------------------
> zero_damaged_pages | on | off | on
>
> Now, my colleague ran this after he tried running some operations again after I told him how to set zero_damaged_pages again. He swears that that it was on
> when the first VACUUM
> FULL ANALYZE was run, but I'm not as sure. Plus, I don't understand why the boot_val shows as off. In any event, as we're still getting log errors like
> before, I don't really know
> what to try next other than rerunning the VACUUM FULL again. Help?
> --
> Jay
>
>
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin(at)postgresql(dot)org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
>
>
>
>
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message John Scalia 2014-07-17 20:26:16 Re: Trying to recover a corrupted database
Previous Message John Scalia 2014-07-17 19:40:31 Re: Trying to recover a corrupted database