Re: corrupt pages detected by enabling checksums

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: Jeff Davis <pgsql(at)j-davis(dot)com>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: corrupt pages detected by enabling checksums
Date: 2013-04-05 01:06:15
Message-ID: 18113.1365123975@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Jeff Davis <pgsql(at)j-davis(dot)com> writes:
> On Thu, 2013-04-04 at 14:21 -0700, Jeff Janes wrote:
>> This brings up a pretty frightening possibility to me, unrelated to
>> data checksums. If a bit gets twiddled in the WAL file due to a
>> hardware issue or a "cosmic ray", and then a crash happens, automatic
>> recovery will stop early with the failed WAL checksum with
>> an innocuous looking message. The system will start up but will be
>> invisibly inconsistent, and will proceed to overwrite that portion of
>> the WAL file which contains the old data (real data that would have
>> been necessary to reconstruct, once the corruption is finally realized
>> ) with an end-of-recovery checkpoint record and continue to chew up
>> real data from there.

> I've been worried about that for a while, and I may have even seen
> something like this happen before. We could perhaps do some checks, but
> in general it seems hard to solve without writing flushing some data to
> two different places. For example, you could flush WAL, and then update
> an LSN stored somewhere else indicating how far the WAL has been
> written. Recovery could complain if it gets an error in the WAL before
> that point.

> But obviously, that makes WAL flushes expensive (in many cases, about
> twice as expensive).

> Maybe it's not out of the question to offer that as an option if nobody
> has a better idea. Performance-conscious users could place the extra LSN
> on an SSD or NVRAM or something; or maybe use commit_delay or async
> commits. It would only need to store a few bytes.

At least on traditional rotating media, this is only going to perform
well if you dedicate two drives to the purpose. At which point you
might as well just say "let's write two copies of WAL". Or maybe three
copies, so that you can take a vote when they disagree. While this is
not so unreasonable on its face for ultra-high-reliability requirements,
I can't escape the feeling that we'd just be reinventing software RAID.
There's no reason to think that we can deal with this class of problems
better than the storage system can.

> Streaming replication mitigates the problem somewhat, by being a second
> place to write WAL.

Yeah, if you're going to do this at all it makes more sense for the
redundant copies to be on other machines. So the questions that that
leads to are how smart is our SR code about dealing with a master that
tries to re-send WAL regions it already sent, and what a slave ought to
do in such a situation if the new data doesn't match.

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Brendan Jurd 2013-04-05 01:51:56 Re: [PATCH] Exorcise "zero-dimensional" arrays (Was: Re: Should array_length() Return NULL)
Previous Message Jeff Davis 2013-04-05 00:39:16 Re: corrupt pages detected by enabling checksums