Re: fsync reliability

From: Simon Riggs <simon(at)2ndQuadrant(dot)com>
To: Greg Smith <greg(at)2ndquadrant(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: fsync reliability
Date: 2011-04-22 13:32:00
Message-ID: BANLkTikvz91nLy-zQy+rE+CUi9B0EbnPZA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Apr 22, 2011 at 1:35 PM, Greg Smith <greg(at)2ndquadrant(dot)com> wrote:
> Simon Riggs wrote:
>>
>> We do issue fsync and then close, but only when we switch log files.
>> We don't do that as part of the normal commit path.
>>
>
> Since all these files are zero-filled before use, the space is allocated for
> them, and the remaining important filesystem layout metadata gets flushed
> during the close.  The only metadata that changes after that--things like
> the last access time--isn't important to the WAL functioning.  So the
> metadata doesn't need to be updated after a normal commit, it's already
> there.  There are two main risks when crashing while fsync is in the middle
> of executing a push out to physical storage: torn pages due to partial data
> writes, and other out of order writes.  The only filesystems where this
> isn't true are the copy on write ones, where the blocks move around on disk
> too.  But those all have their own more careful guarantees about metadata
> too.

OK, that's good, but ISTM we still have a hole during
RemoveOldXlogFiles() where we don't fsync or open/close the file, just
rename it.

The WAL filename is critical in identifying the next batch of data,
incorrect metadata will have an effect on crash recovery.

So we are relying on the metadata being safe.

>> The issue you raise above where "fsync is not safe for Write Ahead
>> Logging" doesn't sound good. I don't think what you've said has fully
>> addressed that yet. We could replace the commit path with O_DIRECT and
>> physically order the data blocks, but I would guess the code path to
>> durable storage has way too many bits of code tweaking it for me to
>> feel happy that was worth it.
>>
>
> As far as I can tell the CRC is sufficient protection against that.  This is
> all data that hasn't really been committed being torn up here.  Once you
> trust that the metadata problem isn't real, reordered writes are the only
> going to destroy things that are in the middle of being flushed to disk.  A
> synchronous commit mangled this way will be rolled back regardless because
> it never really finished (fsync didn't return); an asynchronous one was
> never guaranteed to be on disk.

OK, that's clear. Thanks for putting my mind at rest.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Merlin Moncure 2011-04-22 13:37:39 Re: "stored procedures"
Previous Message Yves Weißig 2011-04-22 13:29:32 What Index Access Method Functions are really needed?