Invalid headers and xlog flush failures

From: Bricklen Anderson <BAnderson(at)PresiNET(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Invalid headers and xlog flush failures
Date: 2005-02-02 16:45:24
Message-ID: 420103A4.5060104@PresiNET.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi all,

I recently came across some apparent corruption in one of our databases around a month ago.
version: postgresql 8 (originally 8r3, now at 8.0.1), debian box

The messages that we were originally getting in our syslog were about invalid page headers. After
googling around, then dumping the page with pg_filedump, I decided to drop and recreate the affected
table. This seemed to work for a while, until this message cropped up in the syslog during a heavy load:

Feb 1 11:17:49 dev94 postgres[4959]: [470-2] 2005-02-01 11:17:49 PST> CONTEXT: writing block 47272
of relation 1663/17235/57800
Feb 1 11:17:49 dev94 postgres[4959]: [471-1] 2005-02-01 11:17:49 PST> WARNING: could not write
block 47272 of 1663/17235/57800
Feb 1 11:17:49 dev94 postgres[4959]: [471-2] 2005-02-01 11:17:49 PST> DETAIL: Multiple failures
--- write error may be permanent.
Feb 1 11:17:50 dev94 postgres[4959]: [472-1] 2005-02-01 11:17:50 PST> ERROR: xlog flush request
972/FC932854 is not satisfied --- flushed only to 73/86D2640

This maps to an index. I reindexed it (and several other tables), and a 3 hours later, restarted my
load process. Shortly after that, the same thing happened again (with different numbers this time):

Feb 1 14:36:05 dev94 postgres[12887]: [626-2] 2005-02-01 14:36:05 PST> CONTEXT: writing block 7502
of relation 1663/17235/151565
Feb 1 14:36:05 dev94 postgres[12887]: [627-1] 2005-02-01 14:36:05 PST> WARNING: could not write
block 7502 of 1663/17235/151565
Feb 1 14:36:05 dev94 postgres[12887]: [627-2] 2005-02-01 14:36:05 PST> DETAIL: Multiple failures
--- write error may be permanent.
Feb 1 14:36:06 dev94 postgres[12887]: [628-1] 2005-02-01 14:36:06 PST> ERROR: xlog flush request
973/3EF36C2C is not satisfied --- flushed only to 73/419878B4

Both sets are repeated continuously through the syslog.

I pursued some references to XID wraparound, but that didn't seem likely from what I could see
(unless I'm misreading the numbers)
SELECT datname, age(datfrozenxid) FROM pg_database where datname='dev17';
datname | age
---------+------------
dev17 | 1074008776

Here is a pg_filedump of 151565:
$pg_filedump -i -f -R 7502 /var/postgres/data/base/17235/151565

*******************************************************************
* PostgreSQL File/Block Formatted Dump Utility - Version 3.0
*
* File: /var/postgres/data/base/17235/151565
* Options used: -i -f -R 7502
*
* Dump created on: Tue Feb 1 14:34:14 2005
*******************************************************************

Block 7502 ********************************************************
<Header> -----
Block Offset: 0x03a9c000 Offsets: Lower 988 (0x03dc)
Block: Size 8192 Version 2 Upper 3336 (0x0d08)
LSN: logid 115 recoff 0x39e855f4 Special 8176 (0x1ff0)
Items: 242 Free Space: 2348
Length (including item array): 992

Error: Invalid header information.

0000: 73000000 f455e839 01000000 dc03080d s....U.9........
0010: f01f0220 cc912800 e0912800 f4912800 ... ..(...(...(.
0020: 08922800 1c922800 30922800 44922800 ..(...(.0.(.D.(.
<snipped>

<Data> ------
Item 1 -- Length: 20 Offset: 4556 (0x11cc) Flags: USED
Block Id: 9016 linp Index: 2 Size: 20
Has Nulls: 0 Has Varwidths: 16384

11cc: 00003823 02001440 0b000000 022000cf (dot)(dot)8#(dot)(dot)(dot)(at)(dot)(dot)(dot)(dot)(dot) ..
11dc: 66f06500 f.e.

Item 2 -- Length: 20 Offset: 4576 (0x11e0) Flags: USED
Block Id: 9571 linp Index: 8 Size: 20
Has Nulls: 0 Has Varwidths: 16384

11e0: 00006325 08001440 0b000000 022000cf (dot)(dot)c%(dot)(dot)(dot)(at)(dot)(dot)(dot)(dot)(dot) ..
11f0: 66f06400 f.d.

Item 3 -- Length: 20 Offset: 4596 (0x11f4) Flags: USED
Block Id: 9571 linp Index: 3 Size: 20
Has Nulls: 0 Has Varwidths: 16384

11f4: 00006325 03001440 0b000000 022000cf (dot)(dot)c%(dot)(dot)(dot)(at)(dot)(dot)(dot)(dot)(dot) ..
1204: 66f06400 f.d.
<snipped>

<Special Section> -----
BTree Index Section:
Flags: 0x0001 (LEAF)
Blocks: Previous (1314) Next (1958) Level (0)

1ff0: 22050000 a6070000 00000000 01000000 "...............

*** End of Requested Range Encountered. Last Block Read: 7502 ***

Can anyone suggest what I should try next, or if you need more information, I'll happily supply what
I can.

Inline are the changes I made to pg_filedump to get it to compile and work, as such I can't attest
to pg_filedump's accuracy (which could be the source of those invalid header messages)
#########################################
--- pg_filedump.c.old 2004-02-23 12:58:58.000000000 -0800
+++ ../pg_filedump-3.0/pg_filedump.c.new 2005-01-31 09:24:36.000000000
-0800
@@ -742,8 +742,8 @@

printf (" XID: min (%u) CMIN|XMAX: %u CMAX|XVAC: %u\n"
" Block Id: %u linp Index: %u Attributes: %d Size: %d\n",
- htup->t_xmin, htup->t_field2.t_cmin,
- htup->t_field3.t_cmax,
+ htup->t_choice.t_heap.t_xmin,htup->t_choice.t_heap.t_cmin,
+ htup->t_choice.t_heap.t_field4.t_cmax,
((uint32)
((htup->t_ctid.ip_blkid.bi_hi << 16) | (uint16) htup->t_ctid.ip_blkid.bi_lo)),
htup->t_ctid.ip_posid,
@@ -768,8 +768,8 @@
strcat (flagString, "HASCOMPRESSED|");
if (infoMask & HEAP_HASOID)
strcat (flagString, "HASOID|");
- if (infoMask & HEAP_XMAX_IS_XMIN)
- strcat (flagString, "XMAX_IS_XMIN|");
+ //if (infoMask & HEAP_XMAX_IS_XMIN)
+ // strcat (flagString, "XMAX_IS_XMIN|");
if (infoMask & HEAP_XMAX_UNLOGGED)
strcat (flagString, "XMAX_UNLOGGED|");
if (infoMask & HEAP_XMIN_COMMITTED)
@@ -1042,7 +1042,7 @@
" Previous Checkpoint Record: Log File (%u) Offset (0x%08x)\n"
" Last Checkpoint Record Redo: Log File (%u) Offset (0x%08x)\n"
" |- Undo: Log File (%u) Offset (0x%08x)\n"
- " |- StartUp ID: %u\n"
+ //" |- StartUp ID: %u\n"
" |- Next XID: %u\n"
" |- Next OID: %u\n"
" - Time: %s"
@@ -1064,7 +1064,7 @@
controlData->prevCheckPoint.xlogid,
controlData->prevCheckPoint.xrecoff,checkPoint->redo.xlogid,
checkPoint->redo.xrecoff, checkPoint->undo.xlogid,
- checkPoint->undo.xrecoff, checkPoint->ThisStartUpID,
+ checkPoint->undo.xrecoff, /*checkPoint->ThisStartUpID,*/
checkPoint->nextXid, checkPoint->nextOid,
ctime (&checkPoint->time), controlData->blcksz,
controlData->relseg_size, controlData->nameDataLen,
#########################################

Obviously, the bottom 3 chunks aren't very useful, but a cursory examination of the relevant headers
didn't make it obvious what these should be changed to.

Sorry for the long-winded email, and thanks for taking the time to read this far!

Cheers,

Bricklen

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Tom Lane 2005-02-02 16:49:18 Re: basic pg lock question
Previous Message Andreas Duffner 2005-02-02 16:39:52 Re: NewsServer down ? (PostgreSql.org: trusted sender for your account)