Re: fsync reliability

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fsync reliability
Date: 2011-04-22 11:41:21
Message-ID: BANLkTi=bCJGR522C_OizbB3KMpc4gqNP-w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 22, 2011 at 4:51 AM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> On 04/21/2011 04:26 AM, Simon Riggs wrote:
>>
>> However, that begs the question of what happens with WAL. At present,
>> we do nothing to ensure that "the entry in the directory containing
>> the file has also reached disk".
>>
>
>
> Well, we do, but it's not obvious why that is unless you've stared at this
> for far too many hours.  A clear description of the possible issue you and
> Dan are raising showed up on LKML a few years ago:
>  http://lwn.net/Articles/270891/
>
> Here's the most relevant part, which directly addresses the WAL case:
>
> "[fsync] is unsafe for write-ahead logging, because it doesn't really
> guarantee any _ordering_ for the writes at the hard storage level.  So aside
> from losing committed data, it can also corrupt structural metadata.  With
> ext3 it's quite easy to verify that fsync/fdatasync don't always write a
> journal entry.  (Apart from looking at the kernel code :-)
>
> Just write some data, fsync(), and observe the number of writes in
> /proc/diskstats.  If the current mtime second _hasn't_ changed, the inode
> isn't written.  If you write data, say, 10 times a second to the same place
> followed by fsync(), you'll see a little more than 10 write I/Os, and less
> than 20."
>
> There's a terrible hack suggested where you run fchmod to force the journal
> out in the next fsync that makes me want to track the poster down and shoot
> him, but this part raises a reasonable question.
>
> The main issue he's complaining about here is a moot one for PostgreSQL.  If
> the WAL rewrites have been reordered but have not completed, the minute WAL
> replay hits the spot with a missing block the CRC32 will be busted and
> replay is finished.  The fact that he's assuming a database would have such
> a naive WAL implementation that it would corrupt the database if blocks are
> written out of order in between fsync call returning is one of the reasons
> this whole idea never got more traction--hard to get excited about a
> proposal whose fundamentals rest on an assumption that doesn't turns out to
> be true on real databases.
>
> There's still the "fsync'd a data block but not the directory entry yet"
> issue as fall-out from this too.  Why doesn't PostgreSQL run into this
> problem?  Because the exact code sequence used is this one:
>
> open
> write
> fsync
> close
>
> And Linux shouldn't ever screw that up, or the similar rename path.  Here's
> what the close man page says, from http://linux.die.net/man/2/close :
>
> "A successful close does not guarantee that the data has been successfully
> saved to disk, as the kernel defers writes. It is not common for a
> filesystem to flush the buffers when the stream is closed. If you need to be
> sure that the data is physically stored use fsync(2). (It will depend on the
> disk hardware at this point.)"
>
> What this is alluding to is that if you fsync before closing, the close will
> write all the metadata out too.  You're busted if your write cache lies, but
> we already know all about that issue.
>
> There was a discussion of issues around this on LKML a few years ago, with
> Alan Cox getting the good pull quote at http://lkml.org/lkml/2009/3/27/268 :
> "fsync/close() as a pair allows the user to correctly indicate their
> requirements."  While fsync doesn't guarantee that metadata is written out,
> and neither does close, kernel developers seem to all agree that
> fsync-before-close means you want everything on disk.  Filesystems that
> don't honor that will break all sorts of software.
>
> It is of course possible there are bugs in some part of this code path,
> where a clever enough test case might expose a window of strange
> file/metadata ordering.  I think it's too weak of a theorized problem to go
> specifically chasing after though.

We do issue fsync and then close, but only when we switch log files.
We don't do that as part of the normal commit path.

I agree that there isn't a "crash bug" here. If WAL metadata is wrong,
or if WAL data blocks are missing, then this will just show up as an
"end of WAL" condition on crash recovery. Postgres will still work at
the end of it. What worries me is that because we always end on an
error, we have no real way of knowing if this has happened never or
lots.

Now I think about it, I can't really see a good reason why we apply
WAL files in sequence trusting just the file name sequence during
crash recovery. The files contain information to allow us to identify
the contents, so if we can't see a file with the right name we can
always scan other files to see if they are the right ones. I would
prefer a WAL file ordering that wasn't dependent at all on file name.
If we did that we wouldn't need to do the file rename thing, we could
just have files called log1, log2 etc.. Archiving can still use
current file names.

The issue you raise above where "fsync is not safe for Write Ahead
Logging" doesn't sound good. I don't think what you've said has fully
addressed that yet. We could replace the commit path with O_DIRECT and
physically order the data blocks, but I would guess the code path to
durable storage has way too many bits of code tweaking it for me to
feel happy that was worth it.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Greg Smith 2011-04-22 12:35:09 Re: fsync reliability
Previous Message Christoph Berg 2011-04-22 11:35:44 Re: psql 9.1 alpha5: connection pointer is NULL