Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS

From: Craig Ringer <craig(at)2ndquadrant(dot)com>
To: Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
Cc: Andrew Gierth <andrew(at)tao11(dot)riddles(dot)org(dot)uk>, Bruce Momjian <bruce(at)momjian(dot)us>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
Date: 2018-04-08 03:27:45
Message-ID: CAMsr+YEeTrvp9wVnQTpOszaiecCfy6iA6SS0ADyPcyMOSA8HXQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 8 April 2018 at 10:16, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
wrote:

> So, what can we actually do about this new Linux behaviour?
>

Yeah, I've been cooking over that myself.

More below, but here's an idea #5: decide InnoDB has the right idea, and go
to using a single massive blob file, or a few giant blobs.

We have a storage abstraction that makes this way, way less painful than it
should be.

We can virtualize relfilenodes into storage extents in relatively few big
files. We could use sparse regions to make the addressing more convenient,
but that makes copying and backup painful, so I'd rather not.

Even one file per tablespace for persistent relation heaps, another for
indexes, another for each fork type.

That way we can use something like your #1 (which is what I was also
thinking about then rejecting previously), but reduce the pain by reducing
the FD count drastically so exhausting FDs stops being a problem.

Previously I was leaning toward what you've described here:

> * whenever you open a file, either tell the checkpointer so it can
> open it too (and wait for it to tell you that it has done so, because
> it's not safe to write() until then), or send it a copy of the file
> descriptor via IPC (since duplicated file descriptors share the same
> f_wb_err)
>
> * if the checkpointer can't take any more file descriptors (how would
> that limit even work in the IPC case?), then it somehow needs to tell
> you that so that you know that you're responsible for fsyncing that
> file yourself, both on close (due to fd cache recycling) and also when
> the checkpointer tells you to
>
> Maybe it could be made to work, but sheesh, that seems horrible. Is
> there some simpler idea along these lines that could make sure that
> fsync() is only ever called on file descriptors that were opened
> before all unflushed writes, or file descriptors cloned from such file
> descriptors?
>

... and got stuck on "yuck, that's awful".

I was assuming we'd force early checkpoints if the checkpointer hit its fd
limit, but that's even worse.

We'd need to urgently do away with segmented relations, and partitions
would start to become a hinderance.

Even then it's going to be an unworkable nightmare with heavily partitioned
systems, systems that use schema-sharding, etc. And it'll mean we need to
play with process limits and, often, system wide limits on FDs. I imagine
the performance implications won't be pretty.

Idea 2:
>
> Give up, complain that this implementation is defective and
> unworkable, both on POSIX-compliance grounds and on POLA grounds, and
> campaign to get it fixed more fundamentally (actual details left to
> the experts, no point in speculating here, but we've seen a few
> approaches that work on other operating systems including keeping
> buffers dirty and marking the whole filesystem broken/read-only).
>

This appears to be what SQLite does AFAICS.

https://www.sqlite.org/atomiccommit.html

though it has the huge luxury of a single writer, so it's probably only
subject to the original issue not the multiprocess / checkpointer issues we
face.

> Idea 3:
>
> Give up on buffered IO and develop an O_SYNC | O_DIRECT based system ASAP.
>

That seems to be what the kernel folks will expect. But that's going to
KILL performance. We'll need writer threads to have any hope of it not
*totally* sucking, because otherwise simple things like updating a heap
tuple and two related indexes will incur enormous disk latencies.

But I suspect it's the path forward.

Goody.

> Any other ideas?
>
> For a while I considered suggesting an idea which I now think doesn't
> work. I thought we could try asking for a new fcntl interface that
> spits out wb_err counter. Call it an opaque error token or something.
> Then we could store it in our fsync queue and safely close the file.
> Check again before fsync()ing, and if we ever see a different value,
> PANIC because it means a writeback error happened while we weren't
> looking. Sadly I think it doesn't work because AIUI inodes are not
> pinned in kernel memory when no one has the file open and there are no
> dirty buffers, so I think the counters could go away and be reset.
> Perhaps you could keep inodes pinned by keeping the associated buffers
> dirty after an error (like FreeBSD), but if you did that you'd have
> solved the problem already and wouldn't really need the wb_err system
> at all. Is there some other idea long these lines that could work?

I think our underlying data syncing concept is fundamentally broken, and
it's not really the kernel's fault.

We assume that we can safely:

procA: open()
procA: write()
procA: close()

... some long time later, unbounded as far as the kernel is concerned ...

procB: open()
procB: fsync()
procB: close()

If the kernel does writeback in the middle, how on earth is it supposed to
know we expect to reopen the file and check back later?

Should it just remember "this file had an error" forever, and tell every
caller? In that case how could we recover? We'd need some new API to say
"yeah, ok already, I'm redoing all my work since the last good fsync() so
you can clear the error flag now". Otherwise it'd keep reporting an error
after we did redo to recover, too.

I never really clicked to the fact that we closed relations with pending
buffered writes, left them closed, then reopened them to fsync. That's ....
well, the kernel isn't the only thing doing crazy things here.

Right now I think we're at option (4): If you see anything that smells like
a write error in your kernel logs, hard-kill postgres with -m immediate (do
NOT let it do a shutdown checkpoint). If it did a checkpoint since the
logs, fake up a backup label to force redo to start from the last
checkpoint before the error. Otherwise, it's safe to just let it start up
again and do redo again.

Fun times.

This also means AFAICS that running Pg on NFS is extremely unsafe, you MUST
make sure you don't run out of disk. Because the usual safeguard of space
reservation against ENOSPC in fsync doesn't apply to NFS. (I haven't tested
this with nfsv3 in sync,hard,nointr mode yet, *maybe* that's safe, but I
doubt it). The same applies to thin-provisioned storage. Just. Don't.

This helps explain various reports of corruption in Docker and various
other tools that use various sorts of thin provisioning. If you hit ENOSPC
in fsync(), bye bye data.

--
Craig Ringer http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2018-04-08 03:34:22 Re: pgsql: Support partition pruning at execution time
Previous Message David Rowley 2018-04-08 03:25:32 Re: pgsql: Support partition pruning at execution time