Debugging a postgresql server crash.

From: Clemens Wehrmann <clemens(dot)wehrmann(at)ciao-de(dot)com>
To: pgsql-general(at)postgresql(dot)org
Subject: Debugging a postgresql server crash.
Date: 2005-02-18 17:13:27
Message-ID: 200502181813.27714.clemens.wehrmann@ciao-de.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I am trying to debug a postgresql server crash and am out of my depth. The
server is sent a segmentation fault when accessing certain data. I've
tracked down one such place (though there may be others). I'm wondering if
its possible for a more experienced person to see from the below if I'm
looking at table corruption. pg_dump also crashes the server. The
filesystem purports to be consistent and I haven't yet found bad blocks.

Thanks much to anyone who reads on and the many whose questions and answers
have been so helpful to this lurker in the past.

Cheers,
--clemens

======
Set up:
======
FreeBSD hog6 5.3-RELEASE
dual amd64
amr(RAID10) MegaRAID
postgresql 7.4.5 from ports

=======
The table
=======
catfang_de_broken=# \d imp_product
Table "public.imp_product"
Column | Type | Modifiers
-------------+---------+-------------------------------------------------------------
product_id | integer | not null default
(nextval('product_id_seq'::text))::integer
category_id | integer | not null
name | text | not null
created | date | not null
current | boolean |
active | boolean | not null default true
action | integer | not null
Indexes:
"imp_product_pkey" primary key, btree (product_id)
"imp_product_action_idx" btree ("action")
"imp_product_category_idx" btree (category_id)
Check constraints:
"imp_product_category_id" CHECK (category_id <> 0)
"imp_product_product_id" CHECK (product_id <> 0)
Foreign-key constraints:
"$1" FOREIGN KEY (category_id) REFERENCES imp_category(category_id)

=====================================================
Selecting works for records where offset < 129833, but is broken for the next
record.
=====================================================
catfang_de_broken=# select * from imp_product offset 129832 limit 1;
product_id | category_id | name | created | current | active | action
------------+-------------+----------+------------+---------+--------+--------
761377 | 11996 | KGN 7060 | 2000-11-21 | | t | 0
(1 row)
catfang_de_broken=# select * from imp_product offset 129831 limit 1;
FATAL: terminating connection due to administrator command
server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.

======================================================
The same two queries from the log. No error messages prior to the crash.
======================================================
Feb 18 13:02:11 hog6 postgres[94548]: [130033-3] STATEMENT: select * from
imp_product offset 129832 limit 1;
Feb 18 13:02:11 hog6 postgres[94548]: [130034-1] LOG: 00000: statement:
select * from imp_product offset 129833 limit 1;
Feb 18 13:02:11 hog6 postgres[94548]: [130034-2] LOCATION: pg_parse_query,
postgres.c:464
Feb 18 13:02:11 hog6 postgres[94548]: [130034-3] STATEMENT: select * from
imp_product offset 129833 limit 1;
Feb 18 13:02:34 hog6 postgres[94330]: [1-1] LOG: 00000: server process (PID
94548) was terminated by signal 11
Feb 18 13:02:34 hog6 postgres[94330]: [1-2] LOCATION: LogChildExit,
postmaster.c:2087
Feb 18 13:02:34 hog6 postgres[94330]: [2-1] LOG: 00000: terminating any other
active server processes
Feb 18 13:02:34 hog6 postgres[94330]: [2-2] LOCATION: CleanupProc,
postmaster.c:2008
Feb 18 13:02:34 hog6 postgres[94330]: [3-1] LOG: 00000: all server processes
terminated; reinitializing

============================
Running truss shows the file it's reading when it crashes.
("/dbdata/pgsql/data/base/17142/17234")
Is there anything to be infered from the mmap or munmap?
============================

2431: open("/dbdata/pgsql/data/base/17142/17234",0x2,0600) = 35 (0x23)
2431: lseek(35,0x0,0x343332) = 144023552 (0x895a000)
2431: open("/dbdata/pgsql/data/base/17142/16652",0x2,0600) = 36 (0x24)
2431: read(0x24,0x801e83680,0x2000) = 8192 (0x2000)
2431: read(0x24,0x801e85680,0x2000) = 8192 (0x2000)
2431: break(0x8b0000) = 0 (0x0)
2431: open("/dbdata/pgsql/data/base/17142/1247",0x2,0600) = 37 (0x25)
2431: read(0x25,0x801e87680,0x2000) = 8192 (0x2000)
2431: read(0x25,0x801e89680,0x2000) = 8192 (0x2000)
2431: lseek(35,0x0,0x3734) = 0 (0x0)
2431: read(0x23,0x801e8b680,0x2000) = 8192 (0x2000)
2431: read(0x23,0x801e8d680,0x2000) = 8192 (0x2000)
2431: read(0x23,0x801e8f680,0x2000) = 8192 (0x2000)
...
2431: read(0x23,0x801c87680,0x2000) = 8192 (0x2000)
2431: read(0x23,0x801c89680,0x2000) = 8192 (0x2000)
2431: break(0x20cd5000) = 0 (0x0)
2431: mmap(0x0,1064960,(0x3)PROT_READ|PROT_WRITE,(0x1002)MAP_ANON|
MAP_PRIVATE,4294967295,0x0) = 34394505216 (0x802128000)
2431: munmap(0x80075b000,0x1000) = 0 (0x0)
SIGNAL 11 (SIGSEGV)
SIGNAL 11 (SIGSEGV)
Process stopped because of: 16
process exit, rval = 139
SIGNAL 20 (SIGCHLD)
2391: select(6,{3 4 5},0x0,0x0,{120 0}) ERR#4 'Interrupted system
call'

=======================
The system tables confirm the file above is related to the data we've
isolated. (/dbdata/pgsql/data/base/17142/17234, imp_product, 129833)
=======================
catfang_de_broken=# select relfilenode from pg_class where
relname='imp_product';
relfilenode
-------------
17234
catfang_de_broken=# select ctid from imp_product offset 129832 limit 1;
ctid
-----------
(4743,60)
(1 row)

catfang_de_broken=# select ctid from imp_product offset 129833 limit 1;
ctid
-----------
(4743,61)
(1 row)

=================================
Copying the file to a new inode does not change anything. So am I right to
conclude table corruption and start developing a deep mistrust of this
relatively new server? Or do I need to look more widely? I tried to look at
the block indicated above using pg_dump, but I don't know enough to see
anything obviously wrong. I'm hoping one of you might.
=================================
./pg_filedump -R 4743 -i -f /dbdata/pgsql/data/base/17142/17234
Block 4743 ********************************************************
<Header> -----
Block Offset: 0x0250e000 Offsets: Lower 320 (0x0140)
Block: Size 8192 Version 1 Upper 344 (0x0158)
LSN: logid 19 recoff 0xca5ad618 Special 8192 (0x2000)
Items: 75 Free Space: 24
Length (including item array): 324

0000: 13000000 18d65aca 0f000000 40015801 (dot)(dot)(dot)(dot)(dot)(dot)Z(dot)(dot)(dot)(dot)(dot)(at)(dot)X(dot)
0010: 00200120 b09f9800 d89ea801 70826001 . . ........p.`.
0020: 589e0001 089e9800 709d3001 289d8800 X.......p.0.(...
0030: b89cd800 689ca000 109ca800 d09b8000 ....h...........
0040: 689bd000 089bc000 989ad800 509a8800 h...........P...
0050: e899c800 9899a000 3899c000 c098f000 ........8.......
0060: 6898a800 8897b801 2097d000 b096d800 h....... .......
0070: 4896c800 c0816001 e095c800 8895a800 H.....`.........
0080: 1095f000 b094b800 4894d000 d893e000 ........H.......
0090: 6893d800 0893b800 a092c800 3892c800 h...........8...
00a0: d891c000 8891a000 1091f000 b090c000 ................
00b0: 4890d000 e08fc800 688fe800 008fd000 H.......h.......
00c0: a08ec000 388ec800 d08dc800 608dd800 ....8.......`...
00d0: 008dc000 a08cb800 388cd000 d08bc800 ........8.......
00e0: 708bc000 188bb000 d88a8000 908a8800 p...............
00f0: 508a8000 088a8800 c8898000 88898000 P...............
0100: 48898000 00898800 b8888800 78888000 H...........x...
0110: 0888d800 9887d800 00873001 50866001 ..........0.P.`.
0120: a8855001 3885e000 5881c800 b8840001 ..P.8...X.......
0130: 5084c800 d0830001 7883a800 2083b000 P.......x... ...
0140: 00000000 ....

<Data> ------
Item 1 -- Length: 76 Offset: 8112 (0x1fb0) Flags: USED
XID: min (733) CMIN|XMAX: 14342668 CMAX|XVAC: 1
Block Id: 14787 linp Index: 21 Attributes: 7 Size: 32
infomask: 0x0513 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_COMMITTED)
t_bits: [0]: 0x6f

1fb0: dd020000 0cdada00 01000000 0000c339 ...............9
1fc0: 15000700 1305206f 00000000 10193d06 ...... o......=.
1fd0: 06740600 8e2e0000 16000000 41736d61 .t..........Asma
1fe0: 72202d20 59616972 2044616c 616c0000 r - Yair Dalal..
1ff0: 9b060000 01000000 00000000 ............

...

Item 60 -- Length: 64 Offset: 2376 (0x0948) Flags: USED
XID: min (733) CMIN|XMAX: 0 CMAX|XVAC: 0
Block Id: 4743 linp Index: 60 Attributes: 7 Size: 32
infomask: 0x0913 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_INVALID)
t_bits: [0]: 0x6f

0948: dd020000 00000000 00000000 00008712 ................
0958: 3c000700 1309206f 00000000 4b193d06 <..... o....K.=.
0968: 219e0b00 dc2e0000 0c000000 4b474e20 !...........KGN
0978: 37303630 45010000 01000000 00000000 7060E...........

Item 61 -- Length: 68 Offset: 2304 (0x0900) Flags: USED
XID: min (733) CMIN|XMAX: 0 CMAX|XVAC: 0
Block Id: 4743 linp Index: 61 Attributes: 7 Size: 32
infomask: 0x0913 (HASNULL|HASVARWIDTH|HASOID|XMIN_COMMITTED|XMAX_INVALID)
t_bits: [0]: 0x6f

0900: dd020000 00000000 00000000 00008712 ................
0910: 3d000700 1309206f 00000000 4c193d06 =..... o....L.=.
0920: f29c0b00 dc2e0000 0d000041 46414220 ...........AFAB
0930: 33325853 33000000 52050000 01000000 32XS3...R.......
0940: 00000000 ....

--
Clemens Wehrmann

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Dieter Schröder 2005-02-18 17:17:40 Re: PostgreSQL Replication
Previous Message Tom Lane 2005-02-18 17:13:08 Re: Database locks