From: | Bruce Momjian <bruce(at)momjian(dot)us> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com> |
Cc: | Craig Ringer <craig(at)2ndquadrant(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Anthony Iliopoulos <ailiop(at)altatus(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Catalin Iacob <iacobcatalin(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Date: | 2018-04-04 13:49:38 |
Message-ID: | 20180404134938.GD25202@momjian.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Apr 4, 2018 at 07:32:04PM +1200, Thomas Munro wrote:
> On Wed, Apr 4, 2018 at 6:00 PM, Craig Ringer <craig(at)2ndquadrant(dot)com> wrote:
> > On 4 April 2018 at 13:29, Thomas Munro <thomas(dot)munro(at)enterprisedb(dot)com>
> > wrote:
> >> /* Ensure that we skip any errors that predate opening of the file */
> >> f->f_wb_err = filemap_sample_wb_err(f->f_mapping);
> >>
> >> [...]
> >
> > Holy hell. So even PANICing on fsync() isn't sufficient, because the kernel
> > will deliberately hide writeback errors that predate our fsync() call from
> > us?
>
> Predates the opening of the file by the process that calls fsync().
> Yeah, it sure looks that way based on the above code fragment. Does
> anyone know better?
Uh, just to clarify, what is new here is that it is ignoring any
_errors_ that happened before the open(). It is not ignoring write()'s
that happened but have not been written to storage before the open().
FYI, pg_test_fsync has always tested the ability to fsync() writes()
from from other processes:
Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
write, fsync, close 5360.341 ops/sec 187 usecs/op
write, close, fsync 4785.240 ops/sec 209 usecs/op
Those two numbers should be similar. I added this as a check to make
sure the behavior we were relying on was working. I never tested sync
errors though.
I think the fundamental issue is that we always assumed that writes to
the kernel that could not be written to storage would remain in the
kernel until they succeeded, and that fsync() would report their
existence.
I can understand why kernel developers don't want to keep failed sync
buffers in memory, and once they are gone we lose reporting of their
failure. Also, if the kernel is going to not retry the syncs, how long
should it keep reporting the sync failure? To the first fsync that
happens after the failure? How long should it continue to record the
failure? What if no fsync() every happens, which is likely for
non-Postgres workloads? I think once they decided to discard failed
syncs and not retry them, the fsync behavior we are complaining about
was almost required.
Our only option might be to tell administrators to closely watch for
kernel write failure messages, and then restore or failover. :-(
The last time I remember being this surprised about storage was in the
early Postgres years when we learned that just because the BSD file
system uses 8k pages doesn't mean those are atomically written to
storage. We knew the operating system wrote the data in 8k chunks to
storage but:
o the 8k pages are written as separate 512-byte sectors
o the 8k might be contiguous logically on the drive but not physically
o even 512-byte sectors are not written atomically
This is why we added pre-page images are written to WAL, which is what
full_page_writes controls.
--
Bruce Momjian <bruce(at)momjian(dot)us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ As you are, so once was I. As I am, so you will be. +
+ Ancient Roman grave inscription +
From | Date | Subject | |
---|---|---|---|
Next Message | Bruce Momjian | 2018-04-04 13:53:01 | Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS |
Previous Message | Alvaro Herrera | 2018-04-04 13:47:32 | Re: pgsql: Validate page level checksums in base backups |