From: | andres(at)anarazel(dot)de (Andres Freund) |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: corrupt pages detected by enabling checksums |
Date: | 2013-04-04 00:28:32 |
Message-ID: | 20130404002832.GB19178@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2013-04-04 01:52:41 +0200, Andres Freund wrote:
> On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
> > I've changed the subject from "regression test failed when enabling
> > checksum" because I now know they are totally unrelated.
> >
> > My test case didn't need to depend on archiving being on, and so with a
> > simple tweak I rendered the two issues orthogonal.
> >
> >
> > On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> >
> > > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
> > >
> > > > What I would probably really want is the data as it existed after the
> > > > crash but before recovery started, but since the postmaster
> > > > immediately starts recovery after the crash, I don't know of a good
> > > > way to capture this.
> > >
> > > Can you just turn off the restart_after_crash GUC? I had a chance to
> > > look at this, and seeing the block before and after recovery would be
> > > nice. I didn't see a log file in the data directory, but it didn't go
> > > through recovery, so I assume it already did that.
> > >
> >
> > You don't know that the cluster is in the bad state until after it goes
> > through recovery because most crashes recover perfectly fine. So it would
> > have to make a side-copy of the cluster after the crash, then recover the
> > original and see how things go, then either retain or delete the side-copy.
> > Unfortunately my testing harness can't do this at the moment, because the
> > perl script storing the consistency info needs to survive over the crash
> > and recovery. It might take me awhile to figure out how to make it do
> > this.
> >
> > I have the server log just go to stderr, where it gets mingled together
> > with messages from my testing harness. I had not uploaded that file.
> >
> > Here is a new upload. It contains both a data_dir tarball including xlog,
> > and the mingled stderr ("do_new.out")
> >
> > https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing
> >
> > The other 3 files in it constitute the harness. It is a quite a mess, with
> > some hard-coded paths. The just-posted fix for wal_keep_segments will also
> > have to be applied.
> >
> >
> >
> > >
> > > The block is corrupt as far as I can tell. The first third is written,
> > > and the remainder is all zeros. The header looks like this:
> > >
> >
> > Yes, that part is by my design. Why it didn't get fixed from a FPI is not
> > by my design, or course.
>
> There are no FPIs (if you mean a full page image with that) afaics:
>
> Your logs tell us about recovery:
> 27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
> 27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
> 27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38
>
> And then the error:
>
> 27698 SELECT 2013-04-03 10:09:16.688 PDT:WARNING: page verification failed, calculated checksum 16296 but expected 49517
> 27698 SELECT 2013-04-03 10:09:16.688 PDT:ERROR: invalid page in block 90 of relation base/16384/835589
>
>
> looking at xlog from that time, the first records are:
> rmgr: XLOG len (rec/tot): 72/ 104, tx: 0, lsn: 1/31000028, prev 1/30000090, bkp: 0000, desc: checkpoint: redo 1/31000028; tli 1; prev tli 1; fpw true; xid 0/26254997; oid 835589; multi 1; offset 0; oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0; online
> rmgr: XLOG len (rec/tot): 4/ 36, tx: 0, lsn: 1/31000090, prev 1/31000028, bkp: 0000, desc: nextOid: 843781
> rmgr: Storage len (rec/tot): 16/ 48, tx: 0, lsn: 1/310000B8, prev 1/31000090, bkp: 0000, desc: file create: base/16384/835589
> rmgr: Heap len (rec/tot): 37/ 7905, tx: 26254997, lsn: 1/310000E8, prev 1/310000B8, bkp: 1000, desc: hot_update: rel 1663/16384/12660; tid 0/47 xmax 26254997 ; new tid 0/33 xmax 0
>
> So the file was created after the last checkpoint, so no full pages need
> to be logged. And we theoretically should be able to recovery all things
> like torn pages from the WAL since that should cover everything that
> happened to that file.
>
> The bkp block 1 indicated in the 4th record above is the only one in
> that period of WAL btw.
>
> Looking a bit more, the last page that's covered by WAL is:
> rmgr: Heap len (rec/tot): 35/ 67, tx: 26254998, lsn: 1/311699A8, prev 1/31169960, bkp: 0000, desc: insert: rel 1663/16384/835589; tid 44/56
>
> which is far earlier than the "block 90" the error is found in unless i
> misunderstand something. Which is strange, since non-zero content to
> those pages shouldn't go to disk until the respective WAL is
> flushed. Huh, are we missing something here?
Looking at the page lsn's with dd I noticed something peculiar:
page 0:
01 00 00 00 18 c2 00 31 => 1/3100C218
page 1:
01 00 00 00 80 44 01 31 => 1/31014480
page 10:
01 00 00 00 60 ce 05 31 => 1/3105ce60
page 43:
01 00 00 00 58 7a 16 31 => 1/31167a58
page 44:
01 00 00 00 f0 99 16 31 => 1/311699f0
page 45:
00 00 00 00 00 00 00 00 => 0/0
page 90:
01 00 00 00 90 17 1d 32 => 1/321d1790
page 91:
01 00 00 00 38 ef 1b 32 => 1/321bef38
So we have written out pages that are after pages without a LSN that
have an LSN thats *beyond* the point XLOG has successfully been written
to disk (1/31169A38)?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Jeff Davis | 2013-04-04 00:32:17 | Re: corrupt pages detected by enabling checksums |
Previous Message | Dickson S. Guedes | 2013-04-04 00:27:27 | Re: patch to add \watch to psql |