Re: Pre-allocating WAL files

From: Nathan Bossart <nathandbossart(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Andy Fan <zhihuifan1213(at)163(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Pre-allocating WAL files
Date: 2025-01-23 18:21:12
Message-ID: Z5KImHaKdpqtYqb4@nathan
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 22, 2025 at 01:00:08PM -0500, Andres Freund wrote:
> On 2025-01-22 11:43:20 -0600, Nathan Bossart wrote:
>> What is the purpose of syncing the file before the rename?
>
> It's from the general durable_rename() code. The reason it's there that it's
> required for "atomically replace a file" use case. Imagine the following:
>
> create_and_fill("somefile.tmp");
> rename("somefile.tmp", "somefile");
> fsync("somefile.tmp");
> fsync(".");
>
> If you crash (OS/HW level) in the wrong moment (between rename() taking effect
> in-memory and the fsyncs), you might end up with "somefile" pointing to the
> *new* file, because the rename took affect, but the new file's content not
> having reached disk yet. I.e. "somefile" will be empty. Whether that's
> possible depends on filesystem semantics (e.g. on ext4 it's possible with
> data=writeback, I think it's always possible on xfs).
>
> In contrast to that, if you fsync("somefile.tmp") before the rename, a crash
> between rename() and the later fsyncs will have "somefile" either pointing to
> the *old and valid contents* or the *new and valid contents*, without a chance
> for an empty file.

Got it, thanks for explaining. If the contents are sync'd before the
rename(), do we still need to fsync() it again afterwards, too? I'd expect
that to ordinarily not have much to do, but perhaps I'm forgetting about
some metadata that isn't covered by the fsync() on the directory.

> However, for the case of WAL recycling, we shouldn't need fsync() before the
> rename, because we ought to already have done so when creating
> (c.f. XLogFileInitInternal() or when recycling it last time.

Makes sense.

> I suspect the theoretically superfluous fsync() won't have a meaningful
> performance impact most of the time though, because
>
> a) There shouldn't be any dirty data for the file, obviously we need to have
> flushed the WAL past the recycled segment
>
> b) Except for the first to-be-recycled segment, we just fsynced after the last
> rename, so there won't be any filesystem journal data that needs to be
> flushed
>
> I'm not entirely sure about a) though - depending on mount options it's
> possible that the fsync() will flush file modification times when using
> wal_sync_method=fdatasync. But even if that's possibly reachable, I doubt
> it'll be common, due to a checkpoint having to complete between the WAL flush
> and recycling. Could be worth experimenting with.

Yeah, I'm not too worried about the performance impact of some superfluous
fsync() calls, either, but I wasn't sure I properly understood the fsync()
pattern in durable_rename() (and figured it'd be nice to get it documented
in the archives).

--
nathan

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2025-01-23 18:22:13 "postmaster became multithreaded" is reachable
Previous Message Tom Lane 2025-01-23 17:30:48 Re: Wrong security context for deferred triggers?