Re: silent data loss with ext4 / all current versions

From: Magnus Hagander <magnus(at)hagander(dot)net>
To: Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
Cc: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL mailing lists <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: silent data loss with ext4 / all current versions
Date: 2016-01-22 08:41:00
Message-ID: CABUevEz2XwaBJz4=XN15UfHQp1TBT9sfy1At-E7DheAEf6e5hw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, Jan 22, 2016 at 9:26 AM, Tomas Vondra <tomas(dot)vondra(at)2ndquadrant(dot)com>
wrote:

> Hi,
>
> On 01/22/2016 06:45 AM, Michael Paquier wrote:
>
> So, I have been playing with a Linux VM with VMware Fusion and on
>> ext4 with data=ordered the renames are getting lost if the root
>> folder is not fsync. By killing-9 the VM I am able to reproduce that
>> really easily.
>>
>
> Yep. Same experience here (with qemu-kvm VMs).
>
> Here are some comments about your patch after a look at the code.
>>
>> Regarding the additions in fsync_fname() in xlog.c:
>> 1) In InstallXLogFileSegment, rename() will be called only if
>> HAVE_WORKING_LINK is not used, which happens only on Windows and
>> cygwin. We could add it for consistency, but it should be within the
>> #else/#endif block. It is not critical as of now.
>> 2) The call in RemoveXlogFile is not necessary, the rename happening
>> only on Windows.
>>
>
> Hmmm, OK. Are we sure HAVE_WORKING_LINK is false only on Windows, or could
> there be some other platforms? And are we sure the file systems on those
> platforms are safe without the fsync call?
>
> That is, while the report references ext4, there may be other file systems
> with the same problem - ext4 was used mostly as it's the most widely used
> Linux file system.
>
> 3) In exitArchiveRecovery if the rename is not made durable I think
>> it does not matter much. Even if recovery.conf is the one present
>> once the node restarts node would move back again to recovery, and
>> actually we had better move back to recovery in this case, no?
>>
>
> I'm strongly against this "optimization" - I'm more than happy to exchange
> the one fsync for not having to manually fix the database after crash.
>
> I don't really see why switching back to recovery should be desirable in
> this case? Imagine you have a warm/hot standby, and that you promote it to
> master. The clients connect, start issuing commands and then the system
> crashes and loses the recovery.conf rename. The system reboots, database
> performs local recovery but then starts as a standby and starts rejecting
> writes. That seems really weird to me.
>
> 4) StartupXLOG for the tablespace map. I don't think this one is
>> needed as well. Even if the tablespace map is not removed after a
>> power loss user would get an error telling that the file should be
>> removed.
>>
>
> Please no, for the same reasons as in (3).
>
> 5) For the one where haveBackupLabel || haveTblspcMap. If we do the
>> fsync, we guarantee that there is no need to do again the recovery.
>> But in case of a power loss, isn't it better to do the recovery again?
>>
>
> Why would it be better? Why should we do something twice when we don't
> have to? Had this not be reliable, then the whole recovery process is
> fundamentally broken and we better fix it instead of merely putting a
> band-aid on it.
>
> 6) For the one after XLogArchiveNotify() for the last partial
>> segment of the old timeline, it doesn't really matter to not make the
>> change persistent as this is mainly done because it is useful to
>> identify that it is a partial segment.
>>
>
> OK, although I still don't quite see why that should be a reason not to do
> the fsync. It's not really going to give us any measurable performance
> advantage (how often we do those fsyncs), so I'd vote to keep it and make
> sure the partial segments are named accordingly. Less confusion is always
> better.
>
> 7) In CancelBackup, this one is not needed as well I think. We would
>> surely want to get back to recovery if those files remain after a
>> power loss.
>>
>
> I may be missing something, but why would we switch to recovery in this
> case?
>
>
>> For the ones in xlogarchive.c:
>> 1) For KeepFileRestoredFromArchive, it does not matter here, we are
>> renaming a file with a temporary name to a permanent name. Once the
>> node restarts, we would see an extra temporary file if the rename
>> was not effective.
>>
>
> So we'll lose the segment (won't have it locally under the permanent
> name), as we've already restored it and won't do that again. Is that really
> a good thing to do?
>
> 2) XLogArchiveForceDone, the only bad thing that would happen here is
>> to leave behind a .ready file instead of a .done file. I guess that we
>> could have it though as an optimization to not have to archive again
>> this file.
>>
>
> Yes.
>
>
>> For the one in pgarch.c:
>> 1) In pgarch_archiveDone, we could add it as an optimization to
>> actually let the server know that the segment has been already
>> archived, preventing a retry.
>>
>
> Not sure what you mean by "could add it as an optimization"?
>
> In timeline.c:
>> 1) writeTimeLineHistoryFile, it does not matter much. In the worst
>> case we would have just a dummy temporary file, and startup process
>> would come back here (in exitArchiveRecovery() we may finish with an
>> unnamed backup file similarly).
>>
>
> OK
>
> 2) In writeTimeLineHistoryFile, similarly we don't need to care
>> much, in WalRcvFetchTimeLineHistoryFiles recovery would take again
>> the same path
>>
>
> OK
>
>
>> Thoughts?
>>
>>
> Thanks for the review and comments. I think the question is whether we
> only want to do the additional fsync() only when it ultimately may lead to
> data loss, or even in cases where it may cause operational issues (e.g.
> switching back to recovery needlessly).
>
> I'd vote for the latter, as I think it makes the database easier to
> operate (less manual interventions) and the performance impact is 0 (as
> those fsyncs are really rare).
>

Yeah, unless it gives a significant performance penalty, I'd agree that the
latter seems like the better option of those. +1 for that way :)

--
Magnus Hagander
Me: http://www.hagander.net/
Work: http://www.redpill-linpro.com/

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Aleksander Alekseev 2016-01-22 10:14:32 Re: Patch: fix lock contention for HASHHDR.mutex
Previous Message Vik Fearing 2016-01-22 08:28:56 Re: Extracting fields from 'infinity'::TIMESTAMP[TZ]