From: | Andrew Biagioni <andrew(dot)biagioni(at)e-greek(dot)net> |
---|---|
To: | "scott(dot)marlowe" <scott(dot)marlowe(at)ihs(dot)com> |
Cc: | pgsql-admin(at)postgresql(dot)org |
Subject: | Re: Spontaneous PostgreSQL Server Reboot? |
Date: | 2004-04-01 04:19:26 |
Message-ID: | 406B984E.7030109@e-greek.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
scott.marlowe wrote:
> On Tue, 30 Mar 2004, Andrew Biagioni wrote:
>
>
>>Alex,
>>
>>the answer is "no" to all of these. We are a tiny start-up (2 guys, and
>>we do our own cleaning); ambient temperature varies significantly but
>>is not related to the failure, and one machine starts beeping when it
>>gets too hot (then we added an extra case fan); no fancy watchdogs
>>(maybe someday... One can only dream :-> ); three different cases,
>>power supplies, motherboards, etc., etc. (one power supply is
>>extra-large, and that's the machine that started failing first!).
>>
>>We originally blamed the problem on hardware failure (first machine);
>>then on OS version/configuration (second machine); now we're out of
>>things to blame, except maybe unusually bad luck...
>
>
> What did memtest86 say?
>
> Did the same person build all the machines? I've seen plenty of folks
> build machines and zap the memory when installing it. >95% of all ESD
> failures are partial / delayed failures, so just because a computer boots
> up doesn't mean proper ESD procedures were followed, and if not, and if
> you're in a dry environment like I am (I live in Denver) then it's quite
> possible all three have bad CPU/mobo/memory or something like that.
Two different people built the machines; we're both electrical
engineers with plenty of familiarity and experience with static issues,
so that particular issue is not likely.
As for memtest86 - I haven't been able to run it on two of the machines
yet (they are in production), and I have to restart the third one (it
was "retired" after the third time it died on us).
Meanwhile I found out some more details:
- the first machine had a software raid system that may have been unreliable
- the second machine had a much older kernel and sloppily-updated
modules, and it would hang -- not reboot
- the last machine to reboot MAY have been a line power issue (the whole
building lost power a few hours later, so I lost some info on other
machines' restarting -- I'll dig more).
So -- it's memtest86 and badblocks for all three (as soon as I can),
better UPS-ing, updated kernel(s), and checking more machines' logs;
then we'll see...
Thanks to you all for the suggestions -- keep them coming!
Andrew
From | Date | Subject | |
---|---|---|---|
Next Message | Hemapriya | 2004-04-01 15:21:01 | Best Platform for postgres. |
Previous Message | Justin Camp | 2004-03-31 22:24:31 | Problems unsubscribing... |