From: | Andres Freund <andres(at)2ndquadrant(dot)com> |
---|---|
To: | Simon Riggs <simon(at)2ndQuadrant(dot)com> |
Cc: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: corrupt pages detected by enabling checksums |
Date: | 2013-04-04 14:53:27 |
Message-ID: | 20130404145327.GA4382@awork2.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 2013-04-04 13:30:40 +0100, Simon Riggs wrote:
> On 4 April 2013 02:39, Andres Freund <andres(at)2ndquadrant(dot)com> wrote:
>
> > Ok, I think I see the bug. And I think its been introduced in the
> > checkpoints patch.
>
> Well spotted. (I think you mean checksums patch).
Heh, yes. I was slightly tired at that point ;)
> > If by now the first backend has proceeded to PageSetLSN() we are writing
> > different data to disk than the one we computed the checksum of
> > before. Boom.
>
> Right, so nothing else we were doing was wrong, that's why we couldn't
> spot a bug. The problem is that we aren't replaying enough WAL because
> the checksum on the WAL record is broke.
Well, there might be other bugs we can't see yet... But lets hope not.
> > I think the whole locking interactions in MarkBufferDirtyHint() need to
> > be thought over pretty carefully.
>
> When we write out a buffer with checksums enabled, we take a copy of
> the buffer so that the checksum is consistent, even while other
> backends may be writing hints to the same bufer.
>
> I missed out on doing that with XLOG_HINT records, so the WAL CRC can
> be incorrect because the data is scanned twice; normally that would be
> OK because we have an exclusive lock on the block, but with hints we
> only have share lock. So what we need to do is take a copy of the
> buffer before we do XLogInsert().
>
> Simple patch to do this attached for discussion. (Not tested).
>
> We might also do this by modifying the WAL record to take the whole
> block and bypass the BkpBlock mechanism entirely. But that's more work
> and doesn't seem like it would be any cleaner. I figure lets solve the
> problem first then discuss which approach is best.
Unfortunately I find that approach unacceptably ugly. I don't think its
ok to make an already *very* complicated function (XLogInsert()) in one
of the most complicated files (xlog.c) even more complicated. It
literally took me years to understand a large percentage of it.
I even think the XLOG_HINT escape hatch should be taken out and be
replaced by something outside of XLogInsert().
Don't think that approach would work with as few lines anyway, since you
need to properly pass it into XLogCheckBuffer() et al which checks the
buffer tags and such - which it needs to properly compute the struct
BkpBlock. Also we iterate over rdata->page for the CRC computation.
I think the route you quickly sketched is more realistic. That would
remove all knowledge obout XLOG_HINT from generic code hich is a very
good thing, I spent like 15minutes yesterday wondering whether the early
return in there might be the cause of the bug...
Something like:
XLogRecPtr
XLogSaveBufferForHint(Buffer buffer)
{
XLogRecPtr recptr = InvalidXLogRecPtr;
XLogRecPtr lsn;
XLogRecData rdata[2];
BkbBlock bkp;
/*
* make sure no checkpoint can happen while we decide about logging
* something or not, since we don't hold any lock preventing that
* otherwise.
*/
Assert(MyPgXact->delayChkpt);
/* update RedoRecPtr */
GetRedoRecPtr();
/* setup phony rdata element */
rdata[0].buffer = buffer;
rdata[0].buffer_std = true; /* is that actually ok? */
/* force full pages on, otherwise checksums won't work? */
if (XLogCheckBuffer(rdata, true, &lsn, &bkp))
{
char copied_buffer[BLCKSZ];
/*
* copy buffer so we don't have to worry about
* concurrent hint bit or lsn updates. We assume pd_lower/upper
* cannot be changed without an exclusive lock, so the contents
* bkp are not racy.
*/
memcpy(copied_buffer, BufferGetBlock(buffer), BLCKSZ);
rdata[0].data = bkp;
rdata[0].len = sizeof(BkbBlock);
rdata[0].buffer = InvalidBuffer;
rdata[0].buffer_std = false;
rdata[0].next = &(rdata[1]);
rdata[1].data = copied_buffer;
rdata[1].len = BLCKSZ;
rdata[1].buffer = InvalidBuffer;
rdata[1].buffer_std = false;
rdata[1].next = NULL;
recptr = XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}
return recptr;
}
That should work?
Greetings,
Andres Freund
--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Rodrigo Barboza | 2013-04-04 14:59:52 | Why there is a PG_GETARG_UINT32 and PG_RETURN_UINT32? |
Previous Message | Heikki Linnakangas | 2013-04-04 13:01:24 | Number of spinlocks |