Very bizarre bug with corrupted index

From: Josh Berkus <josh(at)agliodbs(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Cc: mail(at)joeconway(dot)com
Subject: Very bizarre bug with corrupted index
Date: 2003-05-02 18:08:22
Message-ID: 200305021108.22359.josh@agliodbs.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Folks,

Here's the setup:

The machine: an x86 machine running generic Linux and about 15 server
applications other than PostgreSQL; it is short on both RAM and disk I/O,
resulting in frequent use of swap, which is also slow.

The database: a medium-large (> 100mb) multi-function database including
several large tables (> 200,000 records)

The index: the index in question is on an aggregates-caching table; that is,
when updates are made to one of the largest tables, a trigger (written in C
by Joe Conway) automatically updates category totals in the caching table.
the index is on the category column of the table, which serves as the primary
key.

The story:
-- At 2am, the machine had an unexpected power-out and restarted apparently
without problems (the logs show no significant errors at startup, anyway).
-- At 8pm, a user makes 3 updates to the table launching the triggers. They
get an error message: "ERROR: Index pk_table_aggs is not a btree" What's
really puzzling about the client's report is that the first two updates
appear to have succeeded, and only the 3rd update triggered the error and
failed (according to the client).
-- After trying a number of things and documenting the problem, the client
dropped and rebuilt the corrupted index and things worked normally.

Here are the relevant portions of the database log:

2003-04-21 02:10:17 [2857] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 02:10:38 [2861] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 02:10:38 [2860] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 02:14:38 [1521] DEBUG: database system was interrupted at
2003-04-21 02:07:20 UTC
2003-04-21 02:14:38 [1521] DEBUG: checkpoint record is at 6/76D71AB4
2003-04-21 02:14:38 [1521] DEBUG: redo record is at 6/76D71AB4; undo record
is at 0/0; shutdown FALSE
2003-04-21 02:14:38 [1521] DEBUG: next transaction id: 63994551; next oid:
2377601
2003-04-21 02:14:38 [1521] DEBUG: database system was not properly shut
down; automatic recovery in progress
2003-04-21 02:14:38 [1521] DEBUG: redo starts at 6/76D71AF4
2003-04-21 02:14:40 [1521] DEBUG: ReadRecord: record with zero length at
6/7725D26C
2003-04-21 02:14:40 [1521] DEBUG: redo done at 6/7725D248
2003-04-21 02:14:44 [1521] DEBUG: database system is ready
2003-04-21 02:15:51 [2914] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 02:15:51 [2913] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 02:15:51 [2912] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 02:15:51 [2911] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 02:15:51 [2908] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 02:19:47 [3086] DEBUG: recycled transaction log file
0000000600000076

***

2003-04-21 20:03:08 [2555] ERROR: Index pk_table_aggs is not a btree
2003-04-21 20:03:19 [2828] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 20:03:19 [2825] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 20:03:19 [2822] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 20:03:19 [2818] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 20:03:19 [2555] DEBUG: pq_recvbuf: unexpected EOF on client
connection
2003-04-21 20:03:20 [18330] ERROR: Index pk_table_aggs is not a btree

Copies of the actual corrupted index are available. An examination of them
shows something really bizarre; the entire index is zero-filled! It all
looks like this:

*****************************************************************
* PostgreSQL File/Block Formatted Dump Utility
*
* File: 2369452
* Options used: -idx -S 8192
*
* Dump created on: Thu May 1 19:57:15 2003
*****************************************************************

Block 0 ******************************************************
0000: 00000000 00000000 00000000 00000000 ................
0010: 00000000 00000000 00000000 00000000 ................
0020: 00000000 00000000 00000000 00000000 ................
0030: 00000000 00000000 00000000 00000000 ................
0040: 00000000 00000000 00000000 00000000 ................
0050: 00000000 00000000 00000000 00000000 ................
0060: 00000000 00000000 00000000 00000000 ................
0070: 00000000 00000000 00000000 00000000 ................
0080: 00000000 00000000 00000000 00000000 ................
0090: 00000000 00000000 00000000 00000000 ................
00a0: 00000000 00000000 00000000 00000000 ................
00b0: 00000000 00000000 00000000 00000000 ................
00c0: 00000000 00000000 00000000 00000000 ................
00d0: 00000000 00000000 00000000 00000000 ................
00e0: 00000000 00000000 00000000 00000000 ................
00f0: 00000000 00000000 00000000 00000000 ................
0100: 00000000 00000000 00000000 00000000 ................
0110: 00000000 00000000 00000000 00000000 ................
0120: 00000000 00000000 00000000 00000000 ................
0130: 00000000 00000000 00000000 00000000 ................
0140: 00000000 00000000 00000000 00000000 ................
0150: 00000000 00000000 00000000 00000000 ................
0160: 00000000 00000000 00000000 00000000 ................
0170: 00000000 00000000 00000000 00000000 ...............
etc.

Questions:

1) Is this problem likely to be related to the unexpected power-out, or
unrelated?
2) What could cause even a corrupted index to be zero-filled?
3) Could this problem be due to a memory error since the client is using
non-ECC RAM? If so, how likely is it and can the chances be reduced?
4) Most importantly: Is this a Postgres bug that needs patching?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Tom Lane 2003-05-02 19:25:16 Re: Very bizarre bug with corrupted index
Previous Message pgsql-bugs 2003-05-02 16:13:52 Bug #963: Weird plpython trigger(?) behaviour in 7.3.2