| From: | Chris Travers <chris(at)travelamericas(dot)com> | 
|---|---|
| To: | Ian Westmacott <ianw(at)intellivid(dot)com>, pgsql-admin(at)postgresql(dot)org | 
| Subject: | Re: database corruption | 
| Date: | 2005-04-16 01:29:13 | 
| Message-ID: | 42606A69.9010102@travelamericas.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-admin | 
Hi Ian;
I think it is important to figure out why this is happening.  I would 
not want to run any production databases on systems that were failing 
like this.
I am trying to figure out what are the likely causes of the errors...
1)  Any other computers suffer random application crashes, power downs, 
etc. in your building?
2)  I take it there are no Raid controllers involved?
3)  RAM is non-ECC?
4)  Are the systems on UPS's?
If I could make a wild (and probably wrong) guess, I would wonder if 
something external to the system (like electrical supply) was 
introducing glitches into memory, causing bad data to be written.  I am 
only mentioning it because I have implicated electrical supply in other 
cases where rare computer failurres weer affecting many systems...
Ian Westmacott wrote:
>For several weeks now we have been experiencing fairly
>severe database corruption upon clean reboot.  It is very
>repeatable, and the corruption is of the following forms:
>
>ERROR:  could not access status of transaction foo
>DETAIL:  could not open file "bar": No such file or directory
>
>ERROR:  invalid page header in block foo of relation "bar"
>
>ERROR:  uninitialized page in block foo of relation "bar"
>
>
>At first, we believed this was related to XFS, and have
>been pursuing investigations along those lines.  However,
>we have now experienced the exact same problem with JFS.
>
>Here are some details:
>
>- Postgres 7.4.2
>- 2.6.6 kernel.org kernel
>- dedicated database partition
>- repeatable with XFS and JFS (have not seen on ext3)
>- repeatable with and without Linux software RAID 0
>- repeatable with IDE and SATA
>- repeatable with and without fsync, and with fdatasync
>- repeatable on multiple systems
>
>
>I have two questions:
>
>- any known reason why this might be occurring?  (we must
>  have something wrong, for this high rate of severe
>  error).
>
>- if I don't care about losing data, and am not interested
>  in trying to recover anything, how can I arrange for
>  Postgres to proceed normally?  I know about
>  zero_damaged_pages, but this doesn't help with missing
>  transaction files and such.  Is there any way to get
>  Postgres to chuck anything bad and proceed?
>
>Thanks,
>
>	--Ian
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
>    (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>
>
>  
>
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Ian Westmacott | 2005-04-16 03:39:26 | Re: database corruption | 
| Previous Message | Chris Hoover | 2005-04-15 21:49:07 | Re: Help installing 8.0.2 rpms on RH 3.0 |