Re: Losing records when server hang

From: lec <limec(at)streamyx(dot)com>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Marco Colombo <marco(at)esi(dot)it>, pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Losing records when server hang
Date: 2004-08-10 01:36:03
Message-ID: 41182683.6090409@streamyx.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Tom Lane wrote:

>Marco Colombo <marco(at)esi(dot)it> writes:
>
>
>>Tom Lane wrote:
>>
>>
>>>However this would seem to imply disk drive misfeasance above and beyond
>>>your motherboard problem.
>>>
>>>
>
>
>
>>Well, no. How about this theory:
>>
>>
>
>
>
>>1) everything is ok:
>> the backend executes write()/fsync() for transactions 1-5
>>
>>
>
>
>
>>2) hardware fails some how at MB level (imagine CPU/RAM overheating):
>> RAM gets corrupted - kernel starts oopsing (but goes on)
>> meanwhile, the backend executes write()/fsync() for transactions 6-10,
>> but randomly corrupted data gets written to disk.
>>
>>
>
>
>
>>3) unrecoverable kernel error occurs, the show stops.
>>
>>
>
>
>
>>On recover, transactions 6-9 don't even look like valid log entries, while
>>10, for some reason, does (maybe only data is corrupted).
>>
>>
>
>
>
>>I'm not familiar with the details of WAL files and post-crash recovery,
>>but is that possible? Or does the process stop at the first failure?
>>
>>
>
>Recovery will stop at the first corrupted record, so it would not happen
>like that. But you are right, the MB failure alone might have been
>enough to corrupt the outgoing WAL log data and thus produce the
>scenario I described. Once Postgres *thinks* transactions 1-10 are
>safely down to disk in the WAL log, it will feel free to update the data
>files in any random order that seems convenient. So the write of record
>10 could have occurred before the rest, and if that happened not to get
>corrupted by the MB problem, we could see the result lec describes.
>
>Of course this is all guesswork since we have no direct evidence to look
>at, but it seems fairly plausible.
>
>
>
>>Anyway, if your CPU/RAM is failing, no DB technology can save you.
>>
>>
>
>Agreed. Software certainly cannot make any guarantees if it can't even
>execute correctly ...
>
>
>
Same here. I don't even want to have to prove anything if the hardware
isn't reliable but the "management" queries about the lost transactions,
blaming on system/software/database. I could prove to them that the lost
transactions were due to the system hang, but transaction #10 being
there makes my reasoning doubtful.

Thanks for all your feedbacks and reasoning.

--lec

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Scott Marlowe 2004-08-10 03:03:59 Re: Losing records when server hang
Previous Message lec 2004-08-10 01:29:15 Re: Losing records when server hang