From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)postgreSQL(dot)org |
Subject: | WAL replay failure after file truncation(?) |
Date: | 2005-05-25 15:02:11 |
Message-ID: | 12133.1117033331@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
We've seen two recent reports:
http://archives.postgresql.org/pgsql-admin/2005-04/msg00008.php
http://archives.postgresql.org/pgsql-general/2005-05/msg01143.php
of postmaster restart failing because the WAL contains a reference
to a page that no longer exists.
I can think of a couple of possible explanations:
1. filesystem corruption, ie the page should exist in the file but the
kernel has forgotten about it;
2. we truncated the file subsequent to the WAL record that causes
the panic.
However, neither of these theories is entirely satisfying, because
the WAL replay logic has always acted like this; why haven't we
seen similar reports ever since 7.1? And why are both of these
reports connected to btrees, when file truncation probably happens
far more often on regular tables?
But, setting those nagging doubts aside, theory #2 seems like a definite
bug that we ought to do something about.
The only really clean answer I can see is for file truncation to force a
checkpoint just before issuing the ftruncate call. That way, no WAL
records referencing the to-be-deleted pages would need to be replayed in
a subsequent crash. However, checkpoints are expensive enough to make
this solution very unattractive from a performance point of view. And
I fear it's not a 100% solution anyway: what about the PITR scenario,
where you need to replay a WAL log that was made concurrently with a
filesystem backup being taken? The backup might well include the
truncated version of the file, but you can't avoid replaying the
beginning portion of the WAL log.
Plan B is for WAL replay to always be willing to extend the file to
whatever record number is mentioned in the log, even though this
may require inventing the contents of empty pages; we trust that their
contents won't matter because they'll be truncated again later in the
replay sequence. This seems pretty messy though, especially for
indexes. The major objection to it is that it gives up error detection
in real filesystem-corruption cases: we'll just silently build an
invalid index and then try to run with it. (Still, that might be better
than refusing to start; at least you can REINDEX afterwards.)
Any thoughts?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Stark | 2005-05-25 15:21:03 | Re: PseudoPartitioning and agregates |
Previous Message | Tom Lane | 2005-05-25 13:37:05 | Re: PseudoPartitioning and agregates |