Quick Links

Re: Index corruption & broken clog

From:	Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To:	Benoit Lobréau <benoit(dot)lobreau(at)gmail(dot)com>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: Index corruption & broken clog
Date:	2017-10-13 07:55:31
Message-ID:	1507881331.2770.1.camel@cybertec.at
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Benoit Lobréau wrote:
> One of my PostgreSQL server crashed badly yesterday. A process was killed (see dmesg below) and postgres was stuck with theses process:
>
> postgres 2083 1 0 Oct08 ? 00:19:02 /usr/lib/postgresql/9.5/bin/postgres -D /home/postgres/data/issss090/systeme
> postgres 2221 2083 0 Oct08 ? 00:02:03 postgres: issss090: logger process
> postgres 14068 2083 8 Oct09 ? 02:50:22 [postgres]
> postgres 19668 19665 0 09:40 pts/0 00:00:00 -su
>
> I couldn't stop it with "pg_ctl stop immediate". We had to kill them and release the shared memory block.
>
> dmesg showed this.
>
> [Tue Oct 10 07:45:29 2017] postgres[25506]: segfault at 7f2253ecb000 ip 00007f225aead994 sp 00007ffc2a9c0c28 error 4 in libc-2.23.so[7f225ae0e000+1c0000]
> [Tue Oct 10 07:45:30 2017] BUG: Bad rss-counter state mm:ffff8800516a7c00 idx:0 val:178
> [Tue Oct 10 07:45:31 2017] BUG: Bad page map in process postgres pte:4800000090248c8d pmd:2c3de067
>
> When I checked the postgresql log I found lots of these :
>
> could not read block 76638 in file "pg_tblspc/16395/PG_9.5_201510051/16396/20082": read only 0 of 8192 bytes
> could not read block 76669 in file "pg_tblspc/16395/PG_9.5_201510051/16396/19993": read only 0 of 8192 bytes
> index "degrade_pkey" contains corrupted page at block 60392
> index "degrade_pkey" contains unexpected zero page at block 60392
> index "idx_coor_brute_geometrie" contains corrupted page at block 53061
> index "idx_coor_brute_geometrie" contains corrupted page at block 56109
> index "idx_coor_brute_geometrie" contains corrupted page at block 58867
> index "idx_coor_brute_geometrie" contains corrupted page at block 59003
> index "idx_coor_brute_geometrie" contains corrupted page at block 60138
> index "idx_coor_brute_geometrie" contains corrupted page at block 71974
> index "idx_coor_brute_geometrie" contains corrupted page at block 80968
> index "idx_coor_brute_geometrie" contains corrupted page at block 86777
> index "idx_coor_brute_geometrie" contains unexpected zero page at block 78586
> index "idx_coor_proj_ligne_geometrie" contains corrupted page at block 64349
> index "idx_numero_course" contains corrupted page at block 3435
> index "idx_numero_course" contains corrupted page at block 7176
> index "idx_numero_engin" contains corrupted page at block 30996
> index "idx_utc_envoi_composant_amont" contains corrupted page at block 76497
> index "idx_utc_envoi_reseau" contains corrupted page at block 76524
> right sibling's left-link doesn't match: block 102923 links to 49947 instead of expected 1 in index "idx_utc_reception_fournisseur"
> right sibling's left-link doesn't match: block 103014 links to 51087 instead of expected 1 in index "idx_utc_reception_composant_amont"
> right sibling's left-link doesn't match: block 76978 links to 59148 instead of expected 1 in index "idx_utc_envoi_reseau"
> right sibling's left-link doesn't match: block 77073 links to 82204 instead of expected 1 in index "idx_utc_envoi_composant_amont"
>
> The errors started to show up on inserts 8 hours before the crash.
> I dont understand because when I look into theses pages with pageinspect they are not empty / dont look bad.
> For exemple, this "SELECT * FROM bt_page_items('degrade_pkey',60392);" showed a normal page but the error messages claims its zeroed.
>
> I checked the tables for duplicates on the primary key. Checksum are enabled and I didn't get any error message when I scanned the tables.
> I found 77 duplicates on one table and this:
>
> ERROR: could not access status of transaction 3443523584
> DETAIL: Could not open file "pg_clog/0CD4": No such file or directory.
>
> I dont think this transaction ever existed given the files I found in the clog directory.
>
> postgres(at)ulbdgomp01:~/data/igomp090/systeme/pg_clog$ ls -al
> total 30292
> drwx------ 2 postgres postgres 4096 Oct 9 22:50 .
> drwx------ 18 postgres postgres 4096 Oct 12 02:32 ..
> -rw------- 1 postgres postgres 262144 Aug 16 22:50 0000
> -rw------- 1 postgres postgres 262144 Aug 17 16:20 0001
> -rw------- 1 postgres postgres 262144 Aug 17 22:25 0002
> ...
> -rw------- 1 postgres postgres 262144 Oct 7 19:31 0072
> -rw------- 1 postgres postgres 262144 Oct 9 22:50 0073
> -rw------- 1 postgres postgres 122880 Oct 12 11:56 0074
>
> In your opinion :
> * am I missing a (lots of) clog file(s) or is it a error in the heap page giving an erroneous txid ?
> * Is there a way to find which row has the transaction number 3443523584 (in t_xmin t_xmax I suppose)?
> * Why am I not seeing any obiously broken page when I check with pageinspect ?
>
> We are using:
> * Ubuntu 16.04 LTS
> * PotgreSQL 9.5.8 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609, 64-bit

PostgreSQL did not complain about an index being zeroed, it just
says it can't find a certain page (which probably never existed).

The missing CLOG entries are probably just an effect of data
corruption - a tuple that contains a non-existing transaction
number.

I would recover the database to a point before data corruption set in
on new hardware.

Shut down the corrupted database and make a physical backup of it.

Check the disks and memory on the machine where the problem occurred.

You can try to salvage data from the corrupted tables, either
using "zero_damaged_pages" (but that will only work if the page
header is corrupted), or by excluding problematic tuples by their
"ctid".

Yours,
Laurenz Albe

In response to

Index corruption & broken clog at 2017-10-12 16:30:23 from Benoit Lobréau

Responses

Re: Index corruption & broken clog at 2017-10-13 08:25:42 from Benoit Lobréau

Browse pgsql-general by date

	From	Date	Subject
Next Message	KES	2017-10-13 08:13:43	Re: Fwd: [BUGS] BUG #14850: Implement optinal additinal parameter for 'justify' date/time function
Previous Message	Stephen Cook	2017-10-13 03:44:14	Re: REASSIGN OWNED simply doesn't work