Re: locate DB corruption

From: Dave Peticolas <dave(at)krondo(dot)com>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: locate DB corruption
Date: 2018-09-01 23:45:50
Message-ID: CAPRbp046knJdd0fKV6GmHzPoVx13Dh9BG-9aL1R5F5s3DoJjPg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Fri, Aug 31, 2018 at 8:48 PM Dave Peticolas <dave(at)krondo(dot)com> wrote:

> On Fri, Aug 31, 2018 at 5:19 PM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
> wrote:
>
>> On 08/31/2018 08:51 AM, Dave Peticolas wrote:
>> > On Fri, Aug 31, 2018 at 8:14 AM Adrian Klaver <
>> adrian(dot)klaver(at)aklaver(dot)com
>> > <mailto:adrian(dot)klaver(at)aklaver(dot)com>> wrote:
>> >
>> > On 08/31/2018 08:02 AM, Dave Peticolas wrote:
>> > > Hello, I'm running into the following error running a large query
>> > on a
>> > > database restored from WAL replay:
>> > >
>> > > could not access status of transaction 330569126
>> > > DETAIL: Could not open file "pg_clog/0C68": No such file or
>> directory
>> >
>> >
>> > Postgres version?
>> >
>> >
>> > Right! Sorry, that original email didn't have a lot of info. This is
>> > 9.6.9 restoring a backup from 9.6.8.
>> >
>> > Where is the replay coming from?
>> >
>> >
>> > From a snapshot and WAL files stored in Amazon S3.
>>
>> Seems the process is not creating a consistent backup.
>>
>
> This time, yes. This setup has been working for almost two years with
> probably hundreds of restores in that time. But nothing's perfect I guess :)
>
>
>> How are they being generated?
>>
>
> The snapshots are sent to S3 via a tar process after calling the start
> backup function. I am following the postgres docs here. The WAL files are
> just copied to S3.
>
>
>>
>> > Are you sure you are not working across versions?
>> >
>> >
>> > I am sure, they are all 9.6.
>> >
>> > If not do pg_clog/ and 0C68 actually exist?
>> >
>> >
>> > pg_clog definitely exists, but 0C68 does not. I think I have
>> > subsequently found the precise row in the specific table that seems to
>> > be the problem. Specifically I can select * from TABLE where id = BADID
>> > - 1 or id = BADID + 1 and the query returns. I get the error if I
>> select
>> > the row with the bad ID.
>> >
>> > Now what I'm not sure of is how to fix.
>>
>> One thing I can think of is to rebuild from a later version of your S3
>> data and see if it has all the necessary files.
>>
>
> Yes, I think that's a good idea, I'm trying that.
>
>
>> There is also pg_resetxlog:
>>
>> https://www.postgresql.org/docs/9.6/static/app-pgresetxlog.html
>>
>> I have not used it, so I can not offer much in the way of tips. Just
>> from reading the docs I would suggest stopping the server and then
>> creating a backup of $PG_DATA(if possible) before using pg_resetxlog.
>>
>
> Thanks, I didn't know about that. The primary DB seems OK so hopefully it
> won't be needed.
>

Well restoring from a backup of the primary does seem to have fixed the
issue with the corrupt table.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dave Peticolas 2018-09-01 23:47:03 error in vacuum
Previous Message Dave Cramer 2018-09-01 22:55:33 Re: very slow largeobject transfers through JDBC