| From: | lec <limec(at)streamyx(dot)com> | 
|---|---|
| To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> | 
| Cc: | Marco Colombo <marco(at)esi(dot)it>, pgsql-general <pgsql-general(at)postgresql(dot)org> | 
| Subject: | Re: Losing records when server hang | 
| Date: | 2004-08-10 01:36:03 | 
| Message-ID: | 41182683.6090409@streamyx.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-general | 
Tom Lane wrote:
>Marco Colombo <marco(at)esi(dot)it> writes:
>  
>
>>Tom Lane wrote:
>>    
>>
>>>However this would seem to imply disk drive misfeasance above and beyond
>>>your motherboard problem.
>>>      
>>>
>
>  
>
>>Well, no. How about this theory:
>>    
>>
>
>  
>
>>1) everything is ok:
>>    the backend executes  write()/fsync() for transactions 1-5
>>    
>>
>
>  
>
>>2) hardware fails some how at MB level (imagine CPU/RAM overheating):
>>    RAM gets corrupted - kernel starts oopsing (but goes on)
>>    meanwhile, the backend executes write()/fsync() for transactions 6-10,
>>    but randomly corrupted data gets written to disk.
>>    
>>
>
>  
>
>>3) unrecoverable kernel error occurs, the show stops.
>>    
>>
>
>  
>
>>On recover, transactions 6-9 don't even look like valid log entries, while
>>10, for some reason, does (maybe only data is corrupted).
>>    
>>
>
>  
>
>>I'm not familiar with the details of WAL files and post-crash recovery,
>>but is that possible? Or does the process stop at the first failure?
>>    
>>
>
>Recovery will stop at the first corrupted record, so it would not happen
>like that.  But you are right, the MB failure alone might have been
>enough to corrupt the outgoing WAL log data and thus produce the
>scenario I described.  Once Postgres *thinks* transactions 1-10 are
>safely down to disk in the WAL log, it will feel free to update the data
>files in any random order that seems convenient.  So the write of record
>10 could have occurred before the rest, and if that happened not to get
>corrupted by the MB problem, we could see the result lec describes.
>
>Of course this is all guesswork since we have no direct evidence to look
>at, but it seems fairly plausible.
>
>  
>
>>Anyway, if your CPU/RAM is failing, no DB technology can save you.
>>    
>>
>
>Agreed.  Software certainly cannot make any guarantees if it can't even
>execute correctly ...
>
>  
>
Same here. I don't even want to have to prove anything if the hardware 
isn't reliable but the "management" queries about the lost transactions, 
blaming on system/software/database. I could prove to them that the lost 
transactions were due to the system hang, but transaction #10 being 
there makes my reasoning doubtful.
Thanks for all your feedbacks and reasoning.
--lec
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Scott Marlowe | 2004-08-10 03:03:59 | Re: Losing records when server hang | 
| Previous Message | lec | 2004-08-10 01:29:15 | Re: Losing records when server hang |