Re: corrupt pages detected by enabling checksums

From: Andres Freund <andres(at)2ndquadrant(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Jeff Davis <pgsql(at)j-davis(dot)com>, Fujii Masao <masao(dot)fujii(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: corrupt pages detected by enabling checksums
Date: 2013-04-04 01:39:00
Message-ID: 20130404013900.GE19178@awork2.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 2013-04-04 02:58:43 +0200, Andres Freund wrote:
> On 2013-04-03 20:45:51 -0400, Tom Lane wrote:
> > andres(at)anarazel(dot)de (Andres Freund) writes:
> > > Looking at the page lsn's with dd I noticed something peculiar:
> >
> > > page 0:
> > > 01 00 00 00 18 c2 00 31 => 1/3100C218
> > > page 1:
> > > 01 00 00 00 80 44 01 31 => 1/31014480
> > > page 10:
> > > 01 00 00 00 60 ce 05 31 => 1/3105ce60
> > > page 43:
> > > 01 00 00 00 58 7a 16 31 => 1/31167a58
> > > page 44:
> > > 01 00 00 00 f0 99 16 31 => 1/311699f0
> > > page 45:
> > > 00 00 00 00 00 00 00 00 => 0/0
> > > page 90:
> > > 01 00 00 00 90 17 1d 32 => 1/321d1790
> > > page 91:
> > > 01 00 00 00 38 ef 1b 32 => 1/321bef38
> >
> > > So we have written out pages that are after pages without a LSN that
> > > have an LSN thats *beyond* the point XLOG has successfully been written
> > > to disk (1/31169A38)?
> >
> > If you're looking into the FPIs, those would contain the page's older
> > LSN, not the one assigned by the current WAL record.
>
> Nope, thats from the heap, and the LSNs are *newer* than what startup
> recovered to. I am pretty sure by now we are missing out on valid WAL, I
> am just not sure why.

Ok, I think I see the bug. And I think its been introduced in the
checkpoints patch.

Remember, as explained in other mails, I am pretty sure the end of wal
was errorneously detected, resulting in the following error:
27692 2013-04-03 10:09:15.647 PDT:LOG: incorrect resource manager data checksum in record at 1/31169A68

Not that without a *hardware* crash end of wal is normally found first
by the checks for wrong LSNs or wrong record lengths. Not here.

My theory is that a page that was full page written changed while it was
inserted into the WAL which caused the error. A possibly scenario for
that would e.g. two concurrent calls to MarkBufferDirtyHint().
Note:
* 2. The caller might have only share-lock instead of exclusive-lock on the
* buffer's content lock.

if ((bufHdr->flags & (BM_DIRTY | BM_JUST_DIRTIED)) !=
(BM_DIRTY | BM_JUST_DIRTIED))
{
if (DataChecksumsEnabled() && (bufHdr->flags & BM_PERMANENT))
{
MyPgXact->delayChkpt = delayChkpt = true;
lsn = XLogSaveBufferForHint(buffer); /* PART 1 */
}
}

LockBufHdr(bufHdr);
Assert(bufHdr->refcount > 0);

if (!(bufHdr->flags & BM_DIRTY))
{
dirtied = true; /* Means "will be dirtied by this action" */

/*
* Set the page LSN if we wrote a backup block. We aren't
* supposed to set this when only holding a share lock but
* as long as we serialise it somehow we're OK. We choose to
* set LSN while holding the buffer header lock, which causes
* any reader of an LSN who holds only a share lock to also
* obtain a buffer header lock before using PageGetLSN().
* Fortunately, thats not too many places.
*
* If checksums are enabled, you might think we should reset the
* checksum here. That will happen when the page is written
* sometime later in this checkpoint cycle.
*/
if (!XLogRecPtrIsInvalid(lsn))
PageSetLSN(page, lsn); /* PART 2 */
}

So XLogSaveBufferForHint(buffer) is called for all buffers which are not
yet dirty. Without any locks. It just does:

XLogSaveBufferForHint(Buffer buffer)
{
rdata[0].data = (char *) (&watermark);
rdata[0].len = sizeof(int);
rdata[0].next = &(rdata[1]);

rdata[1].data = NULL;
rdata[1].len = 0;
rdata[1].buffer = buffer;
rdata[1].buffer_std = true;
rdata[1].next = NULL;

return XLogInsert(RM_XLOG_ID, XLOG_HINT, rdata);
}

I think what happens is that that one backend grabs WALInsertLock first,
it computes a full page write for the buffer, including a CRC. And
returns a valid LSN. Then it continues towards the PageSetLSN() after
LockBufHdr().
Allthewhile the second backend notices that WALInsertLock is free again
and continues towards the WALInsertLock protected parts of XLogInsert().

Only that it already has done that part:
/*
* Calculate CRC of the data, including all the backup blocks
*
* Note that the record header isn't added into the CRC initially since we
* don't know the prev-link yet. Thus, the CRC will represent the CRC of
* the whole record in the order: rdata, then backup blocks, then record
* header.
*/
INIT_CRC32(rdata_crc);
for (rdt = rdata; rdt != NULL; rdt = rdt->next)
COMP_CRC32(rdata_crc, rdt->data, rdt->len);

It goes on to write all the data to the WAL after successfully getting
WALInsertLock:
while (write_len)
{
while (rdata->data == NULL)
rdata = rdata->next;

if (freespace > 0)
{
if (rdata->len > freespace)
{
memcpy(Insert->currpos, rdata->data, freespace);
rdata->data += freespace;
rdata->len -= freespace;
write_len -= freespace;
}
else
{
memcpy(Insert->currpos, rdata->data, rdata->len);
freespace -= rdata->len;
write_len -= rdata->len;
Insert->currpos += rdata->len;
rdata = rdata->next;
continue;
}
}

/* Use next buffer */
updrqst = AdvanceXLInsertBuffer(false);
curridx = Insert->curridx;
/* Mark page header to indicate this record continues on the page */
Insert->currpage->xlp_info |= XLP_FIRST_IS_CONTRECORD;
Insert->currpage->xlp_rem_len = write_len;
freespace = INSERT_FREESPACE(Insert);
}

If by now the first backend has proceeded to PageSetLSN() we are writing
different data to disk than the one we computed the checksum of
before. Boom.

I think the whole locking interactions in MarkBufferDirtyHint() need to
be thought over pretty carefully.

Does this explanation make sense?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Stark 2013-04-04 01:40:50 Re: Multi-pass planner
Previous Message Ants Aasma 2013-04-04 01:37:33 Re: Page replacement algorithm in buffer cache