From: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
---|---|
To: | Craig Ringer <craig(at)2ndquadrant(dot)com> |
Cc: | Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date: | 2018-04-04 21:28:09 |
Message-ID: | CAEepm=3Ei3-oGh5i-aUrHk6C=3F5F3hG8LBy2_37wDrPTZ2M_Q@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Apr 5, 2018 at 2:00 AM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> I've tried xfs, jfs, ext3, ext4, even vfat. All behave the same on EIO.
> Didn't try zfs-on-linux or other platforms yet.
I think ZFS will be an outlier here, at least in a pure
write()/fsync() test. (1) It doesn't even use the OS page cache,
except when you mmap()*. (2) Its idea of syncing data is to journal
it, and its journal presumably isn't in the OS page cache. In other
words it doesn't use Linux's usual write-back code paths.
While contemplating what exactly it would do (not sure), I came across
an interesting old thread on the freebsd-current mailing list that
discusses UFS, ZFS and the meaning of POSIX fsync(). Here we see a
report of FreeBSD + UFS doing exactly what the code suggests:
https://lists.freebsd.org/pipermail/freebsd-current/2007-August/076578.html
That is, it keeps the pages dirty so it tells the truth later.
Apparently like Solaris/Illumos (based on drive-by code inspection,
see explicit treatment of retrying, though I'm not entirely sure if
the retry flag is set just for async write-back), and apparently
unlike every other kernel I've tried to grok so far (things descended
from ancestral BSD but not descended from FreeBSD, with macOS/Darwin
apparently in the first category for this purpose).
Here's a new ticket in the NetBSD bug database for this stuff:
As mentioned in that ticket and by Andres earlier in this thread,
keeping the page dirty isn't the only strategy that would work and may
be problematic in different ways (it tells the truth but floods your
cache with unflushable stuff until eventually you force unmount it and
your buffers are eventually invalidated after ENXIO errors? I don't
know.). I have no qualified opinion on that. I just know that we
need a way for fsync() to tell the truth about all preceding writes or
our checkpoints are busted.
*We mmap() + msync() in pg_flush_data() if you don't have
sync_file_range(), and I see now that that is probably not a great
idea on ZFS because you'll finish up double-buffering (or is that
triple-buffering?), flooding your page cache with transient data.
Oops. That is off-topic and not relevant for the checkpoint
correctness topic of this thread through, since pg_flush_data() is
advisory only.
--
Thomas Munro
http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2018-04-04 21:34:26 | Re: comments around heap_lock_tuple confus{ing,ed} around deleted tuples |
Previous Message | Andres Freund | 2018-04-04 21:21:40 | comments around heap_lock_tuple confus{ing,ed} around deleted tuples |