Re: Version 7.2.3 unrecoverable crash on missing pg_clog

From: Andy Osborne <andy(at)sift(dot)co(dot)uk>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: Version 7.2.3 unrecoverable crash on missing pg_clog
Date: 2003-01-09 15:12:24
Message-ID: 3E1D9158.8000300@sift.co.uk
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Tom Lane wrote:
> Andy Osborne <andy(at)sift(dot)co(dot)uk> writes:
>
>>One of our databases crashed yesterday with a bug that looks
>>a lot like the non superuser vacuum issue that 7.2.3 was
>>intended to fix, although we do our vacuum with a user that
>>has usesuper=t in pg_user so I guess it's not that simple.
>
>
>>FATAL 2: open of /u0/pgdata/pg_clog/0726 failed: No such file or directory
>
>
> What range of file names do you actually see in pg_clog?

Currently 0000 to 00D6. I don't know what it was last night.

> The fixes in 7.2.3 were for cases that would try to access
> already-removed clog segments (file numbers less than what's present).
> In this case the accessed file name is large enough that I'm thinking
> the problem is due to a garbage transaction number being passed to the
> transaction-status-check code. So my bet is on physical data corruption
> in the table that was causing the problem. It turns out that the first
> detectable symptom of a trashed tuple header is often a failure like
> this :-(.

/u0 is a Linux software RAID using RAID 1 on three disks
with two raid-disks and one spare-disk. The previous backup
(a pg_dump in plain text SQL) ran ok and produced a clean backup
that we were able to reload. We do this every three hours and
the next backup was running when the database crashed. Any
attempt to access the table crashed it again. I don't know if
it helps, but a select * from news where <conditional on a field
with an index) was ok but if the where was not indexed and resulted
in a table scan, it crashed it.

While I wouldn't rule out data corruption, the kernel message
ring has no errors for the md dirver, scsi host adapter or the
disks, which I would expect if we had bad blocks appearing on a
disk or somesuch. The machine has been running with v7.2.3 for
about two months and hasn't shown this problem before. My gut
feeling is that it's something else.

> You didn't happen to make a physical copy of the news table before
> dropping it, did you? It'd be interesting to examine the remains.
> So far, the cases I have seen like this all seem to be due to hardware
> faults, but we've seen it just often enough to make me wonder if there
> is a software issue too.

Sadly, no I didn't. This is one of our live database servers
and I was under a lot of pressure to get it back quickly. If
it does it again, what can I do to provide the most useful
feedback ?.

Thanks,

Andy

--
Andy Osborne **************** "Vertical B2B Communities"
Senior Internet Engineer
Sift Group 100 Victoria Street, Bristol BS1 6HZ
tel:+44 117 915 9600 fax:+44 117 915 9630 http://www.sift.co.uk

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2003-01-09 15:27:32 Re: Version 7.2.3 unrecoverable crash on missing pg_clog
Previous Message Tom Lane 2003-01-09 14:57:51 Re: Version 7.2.3 unrecoverable crash on missing pg_clog