From: | Craig Ringer <craig(at)2ndquadrant(dot)com> |
---|---|
To: | Christophe Pettus <xof(at)thebuild(dot)com> |
Cc: | Greg Stark <stark(at)mit(dot)edu>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date: | 2018-04-09 01:31:56 |
Message-ID: | CAMsr+YHQ3Evsdc5QMwZ6=rbg0Tdwgv+2DCfLnjV+4VJwNTocgQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 9 April 2018 at 05:28, Christophe Pettus <xof(at)thebuild(dot)com> wrote:
>
> > On Apr 8, 2018, at 14:23, Greg Stark <stark(at)mit(dot)edu> wrote:
> >
> > They consider dirty filesystem buffers when there's
> > hardware failure preventing them from being written "a memory leak".
>
> That's not an irrational position. File system buffers are *not*
> dedicated memory for file system caching; they're being used for that
> because no one has a better use for them at that moment. If an inability
> to flush them to disk meant that they suddenly became pinned memory, a
> large copy operation to a yanked USB drive could result in the system
> having no more allocatable memory. I guess in theory that they could swap
> them, but swapping out a file system buffer in hopes that sometime in the
> future it could be properly written doesn't seem very architecturally sound
> to me.
>
Yep.
Another example is a write to an NFS or iSCSI volume that goes away
forever. What if the app keeps write()ing in the hopes it'll come back, and
by the time the kernel starts reporting EIO for write(), it's already
saddled with a huge volume of dirty writeback buffers it can't get rid of
because someone, one day, might want to know about them?
You could make the argument that it's OK to forget if the entire file
system goes away. But actually, why is that ok? What if it's remounted
again? That'd be really bad too, for someone expecting write reliability.
You can coarsen from dirty buffer tracking to marking the FD(s) bad, but
what if there's no FD to mark because the file isn't open at the moment?
You can mark the inode cache entry and pin it, I guess. But what if your
app triggered I/O errors over vast numbers of small files? Again, the
kernel's left holding the ball.
It doesn't know if/when an app will return to check. It doesn't know how
long to remember the failure for. It doesn't know when all interested
clients have been informed and it can treat the fault as cleared/repaired,
either, so it'd have to *keep on reporting EIO for PostgreSQL's own writes
and fsyncs() indefinitely*, even once we do recovery.
The only way it could avoid that would be to keep the dirty writeback pages
around and flagged bad, then clear the flag when a new write() replaces the
same file range. I can't imagine that being practical.
Blaming the kernel for this sure is the easy way out.
But IMO we cannot rationally expect the kernel to remember error state
forever for us, then forget it when we expect, all without actually telling
it anything about our activities or even that we still exist and are still
interested in the files/writes. We've closed the files and gone away.
Whatever we do, it's likely going to have to involve not doing that anymore.
Even if we can somehow convince the kernel folks to add a new interface for
us that reports I/O errors to some listener, like an
inotify/fnotify/dnotify/whatever-it-is-today-notify extension reporting
errors in buffered async writes, we won't be able to rely on having it for
5-10 years, and only on Linux.
--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Craig Ringer | 2018-04-09 01:35:06 | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Previous Message | Michael Paquier | 2018-04-09 01:15:17 | Warnings and uninitialized variables in TAP tests |