From: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
---|---|
To: | Greg Stark <stark(at)mit(dot)edu> |
Cc: | Amit Kapila <amit(dot)kapila(at)huawei(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: corrupt pages detected by enabling checksums |
Date: | 2013-05-10 18:06:49 |
Message-ID: | CAMkU=1zX8vL8_HmJPa61XBp5uTQwEBaKoz93O1zM98x4g4rKTw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, May 10, 2013 at 9:54 AM, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Fri, May 10, 2013 at 5:31 PM, Amit Kapila <amit(dot)kapila(at)huawei(dot)com>
> wrote:
> > In the case where one block is missing, how can it even reach to next
> record
> > to check "prev" pointer.
> > I think it can be possible when one of the record is corrupt and
> following
> > are okay which I think is the
> > case in which it can proceed with warning as suggested by Simon.
>
> A single WAL record can be over 24kB. The checksum covers the entire
> WAL record and if it reports corruption it can be because a chunk in
> the middle wasn't flushed to disk before the system crashed. The
> beginning of the WAL record with the length and checksum and the
> entire following record with its prev pointer might have been flushed
> but the missing block in the middle of this record means it can't be
> replayed. This would be a normal situation in case of a system crash.
>
> If you replayed the following record but not this record you would
> have an inconsistent database.
I don't think we would ever want to *skip* the record and play the next
one. But if it looks like the next record is valid, we might not want to
automatically open the database in a possibly inconsistent state and in the
process overwrite the only existing copy of those WAL records which would
be necessary to make it consistent. Instead, could we present the DBA with
an explicit choice to either open the database, or try to reconstruct the
corrupted record via forensic inspection so that it can be played through
(I have no idea how likely it is that such an attempt would succeed), or to
copy the database for later inspection and then open it.
But based on your description, perhaps refusing to automatically restart
and forcing an explicit decision would happen a lot more often, during
normal crashes with no corruption, than I was thinking it would.
Of course the paranoid DBA could turn off restart_after_crash and do a
manual investigation on every crash, but in that case the database would
refuse to restart even in the case where it perfectly clear that all the
following WAL belongs to the recycled file and not the current file. They
would also have to turn off any startup scripts in init.d, to make sure a
rebooting server doesn't do recovery automatically and destroy evidence
that way.
Cheers,
Jeff
From | Date | Subject | |
---|---|---|---|
Next Message | Marko Kreen | 2013-05-10 18:16:54 | Re: pgcrypto: Fix RSA password-protected keys |
Previous Message | Simon Riggs | 2013-05-10 17:40:53 | Re: corrupt pages detected by enabling checksums |