Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Anthony Iliopoulos <ailiop(at)altatus(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Craig Ringer <craig(at)2ndquadrant(dot)com>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-03 21:47:01
Message-ID: CA+TgmoZJsNt_57LgmDd7y+TLOQL+36VT9x=JMda+sYuCoJ4aDw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 3, 2018 at 6:35 AM, Anthony Iliopoulos <ailiop(at)altatus(dot)com> wrote:
>> Like other people here, I think this is 100% unreasonable, starting
>> with "the dirty pages which cannot been written out are practically
>> thrown away". Who decided that was OK, and on the basis of what
>> wording in what specification? I think it's always unreasonable to
>
> If you insist on strict conformance to POSIX, indeed the linux
> glibc configuration and associated manpage are probably wrong in
> stating that _POSIX_SYNCHRONIZED_IO is supported. The implementation
> matches that of the flexibility allowed by not supporting SIO.
> There's a long history of brokenness between linux and posix,
> and I think there was never an intention of conforming to the
> standard.

Well, then the man page probably shouldn't say CONFORMING TO 4.3BSD,
POSIX.1-2001, which on the first system I tested, it did. Also, the
summary should be changed from the current "fsync, fdatasync -
synchronize a file's in-core state with storage device" by adding ",
possibly by randomly undoing some of the changes you think you made to
the file".

> I believe (as tried to explain earlier) there is a certain assumption
> being made that the writer and original owner of data is responsible
> for dealing with potential errors in order to avoid data loss (which
> should be only of interest to the original writer anyway). It would
> be very questionable for the interface to persist the error while
> subsequent writes and fsyncs to different offsets may as well go through.

No, that's not questionable at all. fsync() doesn't take any argument
saying which part of the file you care about, so the kernel is
entirely not entitled to assume it knows to which writes a given
fsync() call was intended to apply.

> Another process may need to write into the file and fsync, while being
> unaware of those newly introduced semantics is now faced with EIO
> because some unrelated previous process failed some earlier writes
> and did not bother to clear the error for those writes. In a similar
> scenario where the second process is aware of the new semantics, it would
> naturally go ahead and clear the global error in order to proceed
> with its own write()+fsync(), which would essentially amount to the
> same problematic semantics you have now.

I don't deny that it's possible that somebody could have an
application which is utterly indifferent to the fact that earlier
modifications to a file failed due to I/O errors, but is A-OK with
that as long as later modifications can be flushed to disk, but I
don't think that's a normal thing to want.

>> Also, this really does make it impossible to write reliable programs.
>> Imagine that, while the server is running, somebody runs a program
>> which opens a file in the data directory, calls fsync() on it, and
>> closes it. If the fsync() fails, postgres is now borked and has no
>> way of being aware of the problem. If we knew, we could PANIC, but
>> we'll never find out, because the unrelated process ate the error.
>> This is exactly the sort of ill-considered behavior that makes fcntl()
>> locking nearly useless.
>
> Fully agree, and the errseq_t fixes have dealt exactly with the issue
> of making sure that the error is reported to all file descriptors that
> *happen to be open at the time of error*.

Well, in PostgreSQL, we have a background process called the
checkpointer which is the process that normally does all of the
fsync() calls but only a subset of the write() calls. The
checkpointer does not, however, necessarily have every file open all
the time, so these fixes aren't sufficient to make sure that the
checkpointer ever sees an fsync() failure. What you have (or someone
has) basically done here is made an undocumented assumption about
which file descriptors might care about a particular error, but it
just so happens that PostgreSQL has never conformed to that
assumption. You can keep on saying the problem is with our
assumptions, but it doesn't seem like a very good guess to me to
suppose that we're the only program that has ever made them. The
documentation for fsync() gives zero indication that it's
edge-triggered, and so complaining that people wouldn't like it if it
became level-triggered seems like an ex post facto justification for a
poorly-chosen behavior: they probably think (as we did prior to a week
ago) that it already is.

> Not sure I understand this case. The application may indeed re-write
> a bunch of pages that have failed and proceed with fsync(). The kernel
> will deal with combining the writeback of all the re-written pages. But
> further the necessity of combining for performance really depends on
> the exact storage medium. At the point you start caring about
> write-combining, the kernel community will naturally redirect you to
> use DIRECT_IO.

Well, the way PostgreSQL works today, we typically run with say 8GB of
shared_buffers even if the system memory is, say, 200GB. As pages are
evicted from our relatively small cache to the operating system, we
track which files need to be fsync()'d at checkpoint time, but we
don't hold onto the blocks. Until checkpoint time, the operating
system is left to decide whether it's better to keep caching the dirty
blocks (thus leaving less memory for other things, but possibly
allowing write-combining if the blocks are written again) or whether
it should clean them to make room for other things. This means that
only a small portion of the operating system memory is directly
managed by PostgreSQL, while allowing the effective size of our cache
to balloon to some very large number if the system isn't under heavy
memory pressure.

Now, I hear the DIRECT_IO thing and I assume we're eventually going to
have to go that way: Linux kernel developers seem to think that "real
men use O_DIRECT" and so if other forms of I/O don't provide useful
guarantees, well that's our fault for not using O_DIRECT. That's a
political reason, not a technical reason, but it's a reason all the
same.

Unfortunately, that is going to add a huge amount of complexity,
because if we ran with shared_buffers set to a large percentage of
system memory, we couldn't allocate large chunks of memory for sorts
and hash tables from the operating system any more. We'd have to
allocate it from our own shared_buffers because that's basically all
the memory there is and using substantially more might run the system
out entirely. So it's a huge, huge architectural change. And even
once it's done it is in some ways inferior to what we are doing today
-- true, it gives us superior control over writeback timing, but it
also makes PostgreSQL play less nicely with other things running on
the same machine, because now PostgreSQL has a dedicated chunk of
whatever size it has, rather than using some portion of the OS buffer
cache that can grow and shrink according to memory needs both of other
parts of PostgreSQL and other applications on the system.

> I suppose that if the check-and-clear semantics are problematic for
> Pg, one could suggest a kernel patch that opts-in to a level-triggered
> reporting of fsync() on a per-descriptor basis, which seems to be
> non-intrusive and probably sufficient to cover your expected use-case.

That would certainly be better than nothing.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2018-04-03 21:47:58 Re: [HACKERS] path toward faster partition pruning
Previous Message David Steele 2018-04-03 21:17:18 Re: pgsql: Validate page level checksums in base backups