index and table corruption

From: "Anand Kumar, Karthik" <Karthik(dot)AnandKumar(at)classmates(dot)com>
To: "pgsql-general(at)postgresql(dot)org" <pgsql-general(at)postgresql(dot)org>
Subject: index and table corruption
Date: 2013-12-19 18:42:30
Message-ID: CED87E13.C57E7%karthik.anandkumar@memorylane.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

We're looking for help with possible corruption of our indexes and tables.

Seemingly in the middle of normal operations, we will run into errors like
the below:

ERROR: index "mv_visits_p03_n2" contains unexpected zero page at block
15939
ERROR: invalid page header in block 344713 of relation
pg_tblspc/4376157/PG_9.1_201105231/16393/8367465

Following which the database continues on, but IO creeps up until finally
the server becomes unresponsive. The database has never 'crashed' though

A majority of the tables are the same each time, although new ones will
come in, and old ones will go out. A total of about 84 out of 452 tables
have gotten this error so far.

We run postgres verion 9.1.2, installed via the PGDG rpms.
The server runs centos5.6, and the disk backend is Netapp based SAN
Its a 24CPU box, with 768G RAM.
The database is about 1TB. Its a single database cluster.

Things we've tried so far:

- Everytime we run into the error, we restore the database from a previous
snapshot (block level Netapp snapshot). Snapshots are taken with the
postgres hot backup mode enabled, and are clean. They are block level, so
ideally going back to a snapshot should remove any block level corruption
that occurred on the device.

- We set zero_damaged_pages = on, ran a full vacuum and re-index of 4
tables. Both the full vacuum and reindex completed successfully, with no
errors. The same tables showed up when it failed again.

- We've had the sysadmins check for errors with the hardware ­ no errors
so far about any h/w problems, either on the box, with the SAN switches,
or on the filer. We are going to switch over to a different server on the
same SAN backend, to see if that helps

- We suspected it might have something to do with
http://wiki.postgresql.org/wiki/20120924updaterelease, and upgraded to
postgres 9.1.11, that hasn't helped.

- We had shared_buffers set to 60G, and reduced that down to 8G, and then
to 4G, suspecting problems with the background writer handling such high
shared buffers, that hasn't helped either.

Our postgres configuration is:

Version: PostgreSQL 9.1.2 on x86_64-unknown-linux-gnu, compiled by gcc
(GCC) 4.1.2 20080704 (Red Hat 4.1.2-50), 64-bit
Updgraded to: PostgreSQL 9.1.11 on x86_64-unknown-linux-gnu, compiled by
gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50), 64-bit
name |

curent_setting
------------------------------+--------------------------------------------
-------------------------------------------------------
application_name | psql
archive_command | /usr/bin/archiver.sh %f %p
archive_mode | on
checkpoint_completion_target | 0.8
checkpoint_segments | 25
checkpoint_timeout | 10min
checkpoint_warning | 2min
client_encoding | UTF8
commit_siblings | 25
custom_variable_classes | symmetric
DateStyle | ISO, MDY
default_statistics_target | 300
default_text_search_config | pg_catalog.english
effective_cache_size | 128GB
lc_messages | en_US.UTF-8
lc_monetary | en_US.UTF-8
lc_numeric | en_US.UTF-8
lc_time | en_US.UTF-8
listen_addresses | *
log_checkpoints | on
log_destination | syslog
log_directory | /var/lib/pgsql/cmates/admin
log_filename | postgresql-%a.log
log_line_prefix | user=%u,db=%d,ip=%h
log_min_duration_statement | 0
log_rotation_age | 1d
log_rotation_size | 0
log_timezone | US/Pacific
log_truncate_on_rotation | on
logging_collector | off
maintenance_work_mem | 32MB
max_connections | 1500
max_locks_per_transaction | 1000
max_stack_depth | 2MB
max_wal_senders | 5
port | 5432
search_path | activities, alert, announce, askme, audit,
authentication, book, btdt, bulletinboard, cache, cas, cdc, cmates, cmdba,
collection, dep, emailsubscription, emailvalidation, eventmail, feeds,
friend, geo, inbox, invitation, ir
, kkumar, merge, myvisitor, people, photos, prepsports, profile,
provisioning, quiz, registrant_icons, registration, reunion, school,
schoolfeed, shortlist, socialauth, statspack, story, symmetricds, target,
yearbook, "$user", public
shared_buffers | 8GB
synchronous_commit | off
syslog_facility | local0
syslog_ident | postgres
TimeZone | US/Pacific
vacuum_freeze_table_age | 0
wal_buffers | 16MB
wal_level | archive
wal_sync_method | fsync
work_mem | 8MB
(47 rows)

Any help would be most appreciated!

Thanks,
Karthik

Thanks,
Karthik

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Shaun Thomas 2013-12-19 19:23:10 Re: index and table corruption
Previous Message jon@stylesage 2013-12-19 17:44:00 Re: After dump/restoring from 32bit 8.4-windows to 64bit 9.2.4-linux experiencing 10x slowdown on queries