From: | Simon Riggs <simon(at)2ndQuadrant(dot)com> |
---|---|
To: | Greg Stark <stark(at)mit(dot)edu> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jim Nasby <jim(at)nasby(dot)net>, Jeff Davis <pgsql(at)j-davis(dot)com>, Florian Pflug <fgp(at)phlo(dot)org>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Andres Freund <andres(at)2ndquadrant(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: corrupt pages detected by enabling checksums |
Date: | 2013-05-10 17:32:45 |
Message-ID: | CA+U5nMLJTeJ-EzWgujLgFuAdb=AMU16dvfgeQKENenRTff_CwA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 10 May 2013 13:39, Greg Stark <stark(at)mit(dot)edu> wrote:
> On Fri, May 10, 2013 at 7:44 AM, Simon Riggs <simon(at)2ndquadrant(dot)com> wrote:
>>> Having one corrupt record followed by a valid record is not an
>>> abnormal situation. It could easily be the correct end of WAL.
>>
>> I disagree, that *is* an abnormal situation and would not be the
>> "correct end-of-WAL".
>>
>> Each WAL record contains a "prev" pointer to the last WAL record. So
>> for the next record to be valid the prev pointer would need to be
>> exactly correct.
>
> Well then you're wrong. The OS makes no guarantee that blocks are
> written out in order. When the system crashes any random subset of the
> blocks written but not synced might have been written to disk and
> others not. There could be megabytes of correct WAL written with just
> one block in the middle of the first record not written. If no xlog
> sync had occurred (or one was in progress but not completed) then
> that's the correct end of WAL.
I agree that the correct end of WAL is where the last sync occurred.
We don't write() WAL except with an immediate sync(), so the chances
of what you say happening are very low to impossible. The timing
window between the write and the sync is negligible and yet I/O would
need to occur in that window and also be out of order from the order
of the write, which is unlikely because an I/O elevator would either
not touch the order of writes at all, or would want to maintain
sequential order to avoid head movement, which is what we want. I
guess we should add here "...with disks, maybe not with SSDs".
In any case, what is more important is that your idea to make an
occasional write of the minRecoveryPoint and then use that to cross
check against current LSN would allow us to at least confirm that we
have a single corrupt record and report that situation accurately to
the user. That idea will cover 95+% of such problems anyway, since
what we care about is long sequences of WAL records, not just the last
few written at crash, which the above discussion focuses on.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Simon Riggs | 2013-05-10 17:40:53 | Re: corrupt pages detected by enabling checksums |
Previous Message | Tom Lane | 2013-05-10 17:23:30 | Re: corrupt pages detected by enabling checksums |