Re: 9.4 checksum errors in recovery with gin index

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: 9.4 checksum errors in recovery with gin index
Date: 2014-05-07 17:34:21
Message-ID: 20140507173421.GJ13397@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2014-05-07 10:21:26 -0700, Jeff Janes wrote:
> On Wed, May 7, 2014 at 12:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:
>
> > Hi,
> >
> > On 2014-05-07 00:35:35 -0700, Jeff Janes wrote:
> > > When recovering from a crash (with injection of a partial page write at
> > > time of crash) against 7c7b1f4ae5ea3b1b113682d4d I get a checksum
> > > verification failure.
> > >
> > > 16396 is a gin index.
> >
> > Over which type? What was the load? make check?
> >
>
> A gin index on text[].
>
> The load is a variation of the crash recovery tester I've been using the
> last few years, this time adapted to use a gin index in a rather unnatural
> way. I just increment a counter on a random row repeatedly via a unique
> key, but for this purpose that unique key is stuffed into text[] along with
> a bunch of cruft. The cruft is text representations of negative integers,
> the actual key is text representation of nonnegative integers.
>
> The test harness (patch to induce crashes, and two driving programs) and a
> preserved data directory are here:
>
> https://drive.google.com/folderview?id=0Bzqrh1SO9FcESDZVeFk5djJaeHM&usp=sharing
>
> (role jjanes, database jjanes)
>
> As far as I can tell, this problem goes back to the beginning of page
> checksums.

Interesting.

> > > If I have it ignore checksum failures, there is no apparent misbehavior.
> > > I'm trying to bisect it, but it could take a while and I thought someone
> > > might have some theories based on the log:
> >
> > If you have the WAL a pg_xlogdump grepping for everything referring to
> > that block would be helpful.
> >
>
> The only record which mentions block 28486 by name is this one:

Hm, try running it with -b specified.

> rmgr: Gin len (rec/tot): 1576/ 1608, tx: 77882205, lsn:
> 11/30F4C2C0, prev 11/30F4C290, bkp: 0000, desc: Insert new list page, node:
> 1663/16384/16396 blkno: 28486
>
> However, I think that that record precedes the recovery start point.

If that's the case it seems likely that a PageSetLSN() or PageSetDirty()
are missing somewhere...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Stephen Frost 2014-05-07 17:39:17 Re: [v9.5] Custom Plan API
Previous Message Josh Berkus 2014-05-07 17:31:17 Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers