Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-03 14:29:10
Message-ID: CAMsr+YHSj6g3bqzDH90-O8wHd9Eu=+1efjeA13RD+W4JMUOuBw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 3 April 2018 at 10:54, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:

> I think it's always unreasonable to
> throw away the user's data.

Well, we do that. If a txn aborts, all writes in the txn are discarded.

I think that's perfectly reasonable. Though we also promise an all or
nothing effect, we make exceptions even there.

The FS doesn't offer transactional semantics, but the fsync behaviour can
be interpreted kind of similarly.

I don't *agree* with it, but I don't think it's as wholly unreasonable as
all that. I think leaving it undocumented is absolutely gobsmacking, and
it's dubious at best, but it's not totally insane.

> If the writes are going to fail, then let
> them keep on failing every time.

Like we do, where we require an explicit rollback.

But POSIX may pose issues there, it doesn't really define any interface for
that AFAIK. Unless you expect the app to close() and re-open() the file.
Replacing one nonstandard issue with another may not be a win.

> *That* wouldn't cause any data loss,
> because we'd never be able to checkpoint, and eventually the user
> would have to kill the server uncleanly, and that would trigger
> recovery.
>

Yep. That's what I expected to happen on unrecoverable I/O errors. Because,
y'know, unrecoverable.

I was stunned to learn it's not so. And I'm even more amazed to learn that
ext4's errors=remount-ro apparently doesn't concern its self with mere user
data, and may exhibit the same behaviour - I need to rerun my test case on
it tomorrow.

> Also, this really does make it impossible to write reliable programs.
>

In the presence of multiple apps interacting on the same file, yes. I think
that's a little bit of a stretch though.

For a single app, you can recover by remembering and redoing all the writes
you did.

Sucks if your app wants to have multiple processes working together on a
file without some kind of journal or WAL, relying on fsync() alone, mind
you. But at least we have WAL.

Hrm. I wonder how this interacts with wal_level=minimal.

> Even leaving that aside, a PANIC means a prolonged outage on a
> prolonged system - it could easily take tens of minutes or longer to
> run recovery. So saying "oh, just do that" is not really an answer.
> Sure, we can do it, but it's like trying to lose weight by
> intentionally eating a tapeworm. Now, it's possible to shorten the
> checkpoint_timeout so that recovery runs faster, but then performance
> drops because data has to be fsync()'d more often instead of getting
> buffered in the OS cache for the maximum possible time.

It's also spikier. Users have more issues with latency with short, frequent
checkpoints.

> We could also
> dodge this issue in another way: suppose that when we write a page
> out, we don't consider it really written until fsync() succeeds. Then
> we wouldn't need to PANIC if an fsync() fails; we could just re-write
> the page. Unfortunately, this would also be terrible for performance,
> for pretty much the same reasons: letting the OS cache absorb lots of
> dirty blocks and do write-combining is *necessary* for good
> performance.
>

Our double-caching is already plenty bad enough anyway, as well.

(Ideally I want to be able to swap buffers between shared_buffers and the
OS buffer-cache. Almost like a 2nd level of buffer pinning. When we write
out a block, we *transfer* ownership to the OS. Yeah, I'm dreaming. But
we'd sure need to be able to trust the OS not to just forget the block
then!)

> > The error reporting is thus consistent with the intended semantics (which
> > are sadly not properly documented). Repeated calls to fsync() simply do
> not
> > imply that the kernel will retry to writeback the previously-failed
> pages,
> > so the application needs to be aware of that. Persisting the error at the
> > fsync() level would essentially mean moving application policy into the
> > kernel.
>
> I might accept this argument if I accepted that it was OK to decide
> that an fsync() failure means you can forget that the write() ever
> happened in the first place, but it's hard to imagine an application
> that wants that behavior. If the application didn't care about
> whether the bytes really got to disk or not, it would not have called
> fsync() in the first place. If it does care, reporting the error only
> once is never an improvement.
>

Many RDBMSes do just that. It's hardly behaviour unique to the kernel. They
report an ERROR on a statement in a txn then go on with life, merrily
forgetting that anything was ever wrong.

I agree with PostgreSQL's stance that this is wrong. We require an explicit
rollback (or ROLLBACK TO SAVEPOINT) to restore the session to a usable
state. This is good.

But we're the odd one out there. Almost everyone else does much like what
fsync() does on Linux, report the error and forget it.

In any case, we're not going to get anyone to backpatch a fix for this into
all kernels, so we're stuck working around it.

I'll do some testing with ENOSPC tomorrow, propose a patch, report back.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2018-04-03 14:29:40 Re: [PATCH] Verify Checksums during Basebackups
Previous Message Magnus Hagander 2018-04-03 14:28:45 Re: [PATCH] Verify Checksums during Basebackups