Re: corrupt pages detected by enabling checksums

From: Florian Pflug <fgp(at)phlo(dot)org>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: corrupt pages detected by enabling checksums
Date: 2013-04-05 08:34:42
Message-ID: B8B12124-E852-4203-B802-819F720283EE@phlo.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Apr4, 2013, at 23:21 , Jeff Janes <jeff(dot)janes(at)gmail(dot)com> wrote:
> This brings up a pretty frightening possibility to me, unrelated to data checksums. If a bit gets twiddled in the WAL file due to a hardware issue or a "cosmic ray", and then a crash happens, automatic recovery will stop early with the failed WAL checksum with an innocuous looking message. The system will start up but will be invisibly inconsistent, and will proceed to overwrite that portion of the WAL file which contains the old data (real data that would have been necessary to reconstruct, once the corruption is finally realized ) with an end-of-recovery checkpoint record and continue to chew up real data from there.

Maybe we could scan forward to check whether a corrupted WAL record is followed by one or more valid ones with sensible LSNs. If it is, chances are high that we haven't actually hit the end of the WAL. In that case, we could either log a warning, or (better, probably) abort crash recovery. The user would then need to either restore the broken WAL segment from backup, or override the check by e.g. setting recovery_target_record="invalid_record". (The default would be recovery_target_record="last_record". The name of the GUC tries to be consistent with existing recovery.conf settings, even though it affects crash recovery, not archive recovery.)

Corruption of fields which we require to scan past the record would cause false negatives, i.e. no trigger an error even though we do abort recovery mid-way through. There's a risk of false positives too, but they require quite specific orderings of writes and thus seem rather unlikely. (AFAICS, the OS would have to write some parts of record N followed by the whole of record N+1 and then crash to cause a false positive).

best regards,
Florian Pflug

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nicolas Barbier 2013-04-05 09:17:30 Re: Drastic performance loss in assert-enabled build in HEAD
Previous Message Dimitri Fontaine 2013-04-05 08:19:19 Re: CREATE EXTENSION BLOCKS