From: | Chris Travers <chris(at)travelamericas(dot)com> |
---|---|
To: | Ian Westmacott <ianw(at)intellivid(dot)com>, pgsql-admin(at)postgresql(dot)org |
Subject: | Re: database corruption |
Date: | 2005-04-16 01:29:13 |
Message-ID: | 42606A69.9010102@travelamericas.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
Hi Ian;
I think it is important to figure out why this is happening. I would
not want to run any production databases on systems that were failing
like this.
I am trying to figure out what are the likely causes of the errors...
1) Any other computers suffer random application crashes, power downs,
etc. in your building?
2) I take it there are no Raid controllers involved?
3) RAM is non-ECC?
4) Are the systems on UPS's?
If I could make a wild (and probably wrong) guess, I would wonder if
something external to the system (like electrical supply) was
introducing glitches into memory, causing bad data to be written. I am
only mentioning it because I have implicated electrical supply in other
cases where rare computer failurres weer affecting many systems...
Ian Westmacott wrote:
>For several weeks now we have been experiencing fairly
>severe database corruption upon clean reboot. It is very
>repeatable, and the corruption is of the following forms:
>
>ERROR: could not access status of transaction foo
>DETAIL: could not open file "bar": No such file or directory
>
>ERROR: invalid page header in block foo of relation "bar"
>
>ERROR: uninitialized page in block foo of relation "bar"
>
>
>At first, we believed this was related to XFS, and have
>been pursuing investigations along those lines. However,
>we have now experienced the exact same problem with JFS.
>
>Here are some details:
>
>- Postgres 7.4.2
>- 2.6.6 kernel.org kernel
>- dedicated database partition
>- repeatable with XFS and JFS (have not seen on ext3)
>- repeatable with and without Linux software RAID 0
>- repeatable with IDE and SATA
>- repeatable with and without fsync, and with fdatasync
>- repeatable on multiple systems
>
>
>I have two questions:
>
>- any known reason why this might be occurring? (we must
> have something wrong, for this high rate of severe
> error).
>
>- if I don't care about losing data, and am not interested
> in trying to recover anything, how can I arrange for
> Postgres to proceed normally? I know about
> zero_damaged_pages, but this doesn't help with missing
> transaction files and such. Is there any way to get
> Postgres to chuck anything bad and proceed?
>
>Thanks,
>
> --Ian
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 2: you can get off all lists at once with the unregister command
> (send "unregister YourEmailAddressHere" to majordomo(at)postgresql(dot)org)
>
>
>
>
From | Date | Subject | |
---|---|---|---|
Next Message | Ian Westmacott | 2005-04-16 03:39:26 | Re: database corruption |
Previous Message | Chris Hoover | 2005-04-15 21:49:07 | Re: Help installing 8.0.2 rpms on RH 3.0 |