From: | ITAGAKI Takahiro <itagaki(dot)takahiro(at)oss(dot)ntt(dot)co(dot)jp> |
---|---|
To: | pgsql-hackers(at)postgresql(dot)org |
Subject: | PANIC caused by open_sync on Linux |
Date: | 2007-10-26 04:21:56 |
Message-ID: | 20071026123608.29AD.ITAGAKI.TAKAHIRO@oss.ntt.co.jp |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I encountered PANICs on CentOS 5.0 when I ran write-mostly workload.
It occurs only if wal_sync_method is set to open_sync; there were
no problem in fdatasync. It occurred on both Postgres 8.2.5 and 8.3dev.
PANIC: could not write to log file 0, segment 212 at offset 3399680,
length 737280: Input/output error
STATEMENT: COMMIT;
My nearby Linux guy says mixed usage of bufferd I/O and direct I/O
could cause errors (EIO) on many version of Linux kernels. If we use
buffered I/O before direct I/O, Linux could fail to discard kernel buffer
cache of the region and report EIO -- yes, it's a bug in Linux.
We use bufferd I/O on WAL segements even if wal_sync_method is open_sync.
We initialized segements with zero using buffered I/O, and after that,
we re-open them with specified sync options.
The behaviors in the bug are different on RHEL 4 and 5.
RHEL 4 -> No error reports even though the kernel cache is incosistenet.
RHEL 5 -> write() failes with EIO (Input/output error)
PANIC occurs only on RHEL 5, but RHEL 4 also has a problem. If a wal archiver
reads the inconsistent cache of wal segments, it could archive wrong contents
and PITR might fail at the corrupted archived file.
I'll recommend not to use open_sync for users on Linux until the bug is
fiexed. However, are there any idea to avoid the bug and to use direct i/o?
Mixed usage of bufferd and direct i/o is legal, but enforces complexity
to kernels. If we simplify it, things would be more relaxed. For example,
dropping zero-filling and only use direct i/o. Is it possible?
Regards,
---
ITAGAKI Takahiro
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Greg Smith | 2007-10-26 05:19:35 | Re: PANIC caused by open_sync on Linux |
Previous Message | Tom Lane | 2007-10-26 03:26:08 | Re: [HACKERS] 8.3beta1 testing on Solaris |