Quick Links

Re: corrupt pages detected by enabling checksums

From:	Andres Freund <andres(at)2ndquadrant(dot)com>
To:	Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc:	Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: corrupt pages detected by enabling checksums
Date:	2013-04-04 00:39:25
Message-ID:	20130404003925.GC19178@awork2.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2013-04-04 02:28:32 +0200, Andres Freund wrote:
> On 2013-04-04 01:52:41 +0200, Andres Freund wrote:
> > On 2013-04-03 15:57:49 -0700, Jeff Janes wrote:
> > > I've changed the subject from "regression test failed when enabling
> > > checksum" because I now know they are totally unrelated.
> > >
> > > My test case didn't need to depend on archiving being on, and so with a
> > > simple tweak I rendered the two issues orthogonal.
> > >
> > >
> > > On Wed, Apr 3, 2013 at 12:15 PM, Jeff Davis <pgsql(at)j-davis(dot)com> wrote:
> > >
> > > > On Mon, 2013-04-01 at 19:51 -0700, Jeff Janes wrote:
> > > >
> > > > > What I would probably really want is the data as it existed after the
> > > > > crash but before recovery started, but since the postmaster
> > > > > immediately starts recovery after the crash, I don't know of a good
> > > > > way to capture this.
> > > >
> > > > Can you just turn off the restart_after_crash GUC? I had a chance to
> > > > look at this, and seeing the block before and after recovery would be
> > > > nice. I didn't see a log file in the data directory, but it didn't go
> > > > through recovery, so I assume it already did that.
> > > >
> > >
> > > You don't know that the cluster is in the bad state until after it goes
> > > through recovery because most crashes recover perfectly fine. So it would
> > > have to make a side-copy of the cluster after the crash, then recover the
> > > original and see how things go, then either retain or delete the side-copy.
> > > Unfortunately my testing harness can't do this at the moment, because the
> > > perl script storing the consistency info needs to survive over the crash
> > > and recovery. It might take me awhile to figure out how to make it do
> > > this.
> > >
> > > I have the server log just go to stderr, where it gets mingled together
> > > with messages from my testing harness. I had not uploaded that file.
> > >
> > > Here is a new upload. It contains both a data_dir tarball including xlog,
> > > and the mingled stderr ("do_new.out")
> > >
> > > https://drive.google.com/folderview?id=0Bzqrh1SO9FcEQmVzSjlmdWZvUHc&usp=sharing
> > >
> > > The other 3 files in it constitute the harness. It is a quite a mess, with
> > > some hard-coded paths. The just-posted fix for wal_keep_segments will also
> > > have to be applied.
> > >
> > >
> > >
> > > >
> > > > The block is corrupt as far as I can tell. The first third is written,
> > > > and the remainder is all zeros. The header looks like this:
> > > >
> > >
> > > Yes, that part is by my design. Why it didn't get fixed from a FPI is not
> > > by my design, or course.
> >
> > There are no FPIs (if you mean a full page image with that) afaics:
> >
> > Your logs tell us about recovery:
> > 27692 2013-04-03 10:09:15.621 PDT:LOG: redo starts at 1/31000090
> > 27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68
> > 27692 2013-04-03 10:09:15.647 PDT:LOG: redo done at 1/31169A38

> Looking at the page lsn's with dd I noticed something peculiar:
>
> page 0:
> 01 00 00 00 18 c2 00 31 => 1/3100C218
> page 1:
> 01 00 00 00 80 44 01 31 => 1/31014480
> page 10:
> 01 00 00 00 60 ce 05 31 => 1/3105ce60
> page 43:
> 01 00 00 00 58 7a 16 31 => 1/31167a58
> page 44:
> 01 00 00 00 f0 99 16 31 => 1/311699f0
> page 45:
> 00 00 00 00 00 00 00 00 => 0/0
> page 90:
> 01 00 00 00 90 17 1d 32 => 1/321d1790
> page 91:
> 01 00 00 00 38 ef 1b 32 => 1/321bef38
>
> So we have written out pages that are after pages without a LSN that
> have an LSN thats *beyond* the point XLOG has successfully been written
> to disk (1/31169A38)?

By now I am pretty sure the issue is that the end-of-wal is detected too
early. Above the end is detected at 1/31169A38, but the next WAL file
contains valid data:
pg_xlogdump /tmp/tmp/data2/pg_xlog/000000010000000100000032 -n 10
rmgr: Heap2 len (rec/tot): 26/ 58, tx: 0, lsn: 1/32000030, prev 1/31FFFFD8, bkp: 0000, desc: clean: rel 1663/16384/835589; blk 122 remxid 26328926
rmgr: Heap len (rec/tot): 51/ 83, tx: 26328964, lsn: 1/32000070, prev 1/32000030, bkp: 0000, desc: update: rel 1663/16384/835589; tid 122/191 xmax 26328964 ; new tid 129/140 xmax 0
rmgr: Heap2 len (rec/tot): 30/ 62, tx: 0, lsn: 1/320000C8, prev 1/32000070, bkp: 0000, desc: clean: rel 1663/16384/835589; blk 63 remxid 26328803
rmgr: Btree len (rec/tot): 34/ 66, tx: 26328964, lsn: 1/32000108, prev 1/320000C8, bkp: 0000, desc: insert: rel 1663/16384/835590; tid 25/14

That would explains exactly those symptopms, wouldn't it? Whats causing
that - I am not sure yet.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Re: corrupt pages detected by enabling checksums at 2013-04-04 00:28:32 from Andres Freund

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Tom Lane	2013-04-04 00:41:39	Re: Page replacement algorithm in buffer cache
Previous Message	Noah Misch	2013-04-04 00:32:56	Re: Regex with > 32k different chars causes a backend crash