From: | Daniel Farina <daniel(at)heroku(dot)com> |
---|---|
To: | Greg Smith <greg(at)2ndquadrant(dot)com> |
Cc: | pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: fsync reliability |
Date: | 2011-04-25 02:06:06 |
Message-ID: | BANLkTinr8+ntSmRZMKZKMFMqiCbX_tqBhg@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Apr 21, 2011 at 8:51 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> There's still the "fsync'd a data block but not the directory entry yet"
> issue as fall-out from this too. Why doesn't PostgreSQL run into this
> problem? Because the exact code sequence used is this one:
>
> open
> write
> fsync
> close
>
> And Linux shouldn't ever screw that up, or the similar rename path. Here's
> what the close man page says, from http://linux.die.net/man/2/close :
Theodore Ts'o addresses this *exact* sequence of events, and suggests
if you want that rename to definitely stick that you must fsync the
directory:
http://www.linuxfoundation.org/news-media/blogs/browse/2009/03/don%E2%80%99t-fear-fsync
"""
One argument that has commonly been made on the various comment
streams is that when replacing a file by writing a new file and the
renaming “file.new” to “file”, most applications don’t need a
guarantee that new contents of the file are committed to stable store
at a certain point in time; only that either the new or the old
contents of the file will be present on the disk. So the argument is
essentially that the sequence:
fd = open(”foo.new”, O_WRONLY);
write(fd, buf, bufsize);
fsync(fd);
close(fd);
rename(”foo.new”, “foo”);
… is too expensive, since it provides “atomicity and durability”, when
in fact all the application needed was “atomicity” (i.e., either the
new or the old contents of foo should be present after a crash), but
not durability (i.e., the application doesn’t need to need the new
version of foo now, but rather at some intermediate time in the future
when it’s convenient for the OS).
This argument is flawed for two reasons. First of all, the squence
above exactly provides desired “atomicity without durability”. It
doesn’t guarantee which version of the file will appear in the event
of an unexpected crash; if the application needs a guarantee that the
new version of the file will be present after a crash, ***it’s
necessary to fsync the containing directory***
"""
Emphasis mine.
So, all in all, I think the creation of, deletion of, and renaming of
files in the write ahead log area should be followed by a pg_xlog
fsync. I think it is also necessary to fsync directories in the
cluster directory at checkpoint time, also: if a chunk of directory
metadata doesn't make it to disk, a checkpoint occurs, and then
there's a crash then it's possible that replaying the WAL
post-checkpoint won't create/move/delete the file in the cluster.
The fact this hasn't been happening (or hasn't triggered an error,
which would be scarier) may just be a happy accident of that data
being flushed most of the time, meaning that that fsync() on the
directory file descriptor won't cost very much anyway.
--
fdr
From | Date | Subject | |
---|---|---|---|
Next Message | Dan Ports | 2011-04-25 03:33:08 | Re: SSI non-serializable UPDATE performance |
Previous Message | Greg Stark | 2011-04-25 01:36:01 | Re: Unlogged tables, persistent kind |