From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Jeff Janes <jeff(dot)janes(at)gmail(dot)com> |
Cc: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: 9.4 checksum errors in recovery with gin index |
Date: | 2014-05-07 17:34:21 |
Message-ID: | 20140507173421.GJ13397@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2014-05-07 10:21:26 -0700, Jeff Janes wrote:
> On Wed, May 7, 2014 at 12:48 AM, Andres Freund <andres(at)2ndquadrant(dot)com>wrote:
>
> > Hi,
> >
> > On 2014-05-07 00:35:35 -0700, Jeff Janes wrote:
> > > When recovering from a crash (with injection of a partial page write at
> > > time of crash) against 7c7b1f4ae5ea3b1b113682d4d I get a checksum
> > > verification failure.
> > >
> > > 16396 is a gin index.
> >
> > Over which type? What was the load? make check?
> >
>
> A gin index on text[].
>
> The load is a variation of the crash recovery tester I've been using the
> last few years, this time adapted to use a gin index in a rather unnatural
> way. I just increment a counter on a random row repeatedly via a unique
> key, but for this purpose that unique key is stuffed into text[] along with
> a bunch of cruft. The cruft is text representations of negative integers,
> the actual key is text representation of nonnegative integers.
>
> The test harness (patch to induce crashes, and two driving programs) and a
> preserved data directory are here:
>
> https://drive.google.com/folderview?id=0Bzqrh1SO9FcESDZVeFk5djJaeHM&usp=sharing
>
> (role jjanes, database jjanes)
>
> As far as I can tell, this problem goes back to the beginning of page
> checksums.
Interesting.
> > > If I have it ignore checksum failures, there is no apparent misbehavior.
> > > I'm trying to bisect it, but it could take a while and I thought someone
> > > might have some theories based on the log:
> >
> > If you have the WAL a pg_xlogdump grepping for everything referring to
> > that block would be helpful.
> >
>
> The only record which mentions block 28486 by name is this one:
Hm, try running it with -b specified.
> rmgr: Gin len (rec/tot): 1576/ 1608, tx: 77882205, lsn:
> 11/30F4C2C0, prev 11/30F4C290, bkp: 0000, desc: Insert new list page, node:
> 1663/16384/16396 blkno: 28486
>
> However, I think that that record precedes the recovery start point.
If that's the case it seems likely that a PageSetLSN() or PageSetDirty()
are missing somewhere...
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Stephen Frost | 2014-05-07 17:39:17 | Re: [v9.5] Custom Plan API |
Previous Message | Josh Berkus | 2014-05-07 17:31:17 | Re: proposal: Set effective_cache_size to greater of .conf value, shared_buffers |