Re: Pre-allocating WAL files

From: Andres Freund <andres(at)anarazel(dot)de>
To: Nathan Bossart <nathandbossart(at)gmail(dot)com>
Cc: Andy Fan <zhihuifan1213(at)163(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Pre-allocating WAL files
Date: 2025-01-22 18:00:08
Message-ID: zwsab4baoouacwre6h63qzeiaebaaruv6etujlqxnt56knmlpv@zo654cugfyje
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-01-22 11:43:20 -0600, Nathan Bossart wrote:
> On Wed, Jan 22, 2025 at 11:21:03AM -0500, Andres Freund wrote:
> > fsync(open(oldname1));
> > fsync(open(oldname2));
> > ..
> > fsync(open(oldnameN));
> >
> > rename(oldname1, newname1);
> > rename(oldname2, newname2);
> > ..
> > rename(oldnameN, newnameN);
> >
> > fsync(open(newname1));
> > fsync(open(newname2));
> > ..
> > fsync(open(newnameN));
> >
> > fsync(open("pg_wal"));
>
> What is the purpose of syncing the file before the rename?

It's from the general durable_rename() code. The reason it's there that it's
required for "atomically replace a file" use case. Imagine the following:

create_and_fill("somefile.tmp");
rename("somefile.tmp", "somefile");
fsync("somefile.tmp");
fsync(".");

If you crash (OS/HW level) in the wrong moment (between rename() taking effect
in-memory and the fsyncs), you might end up with "somefile" pointing to the
*new* file, because the rename took affect, but the new file's content not
having reached disk yet. I.e. "somefile" will be empty. Whether that's
possible depends on filesystem semantics (e.g. on ext4 it's possible with
data=writeback, I think it's always possible on xfs).

In contrast to that, if you fsync("somefile.tmp") before the rename, a crash
between rename() and the later fsyncs will have "somefile" either pointing to
the *old and valid contents* or the *new and valid contents*, without a chance
for an empty file.

However, for the case of WAL recycling, we shouldn't need fsync() before the
rename, because we ought to already have done so when creating
(c.f. XLogFileInitInternal() or when recycling it last time.

I suspect the theoretically superfluous fsync() won't have a meaningful
performance impact most of the time though, because

a) There shouldn't be any dirty data for the file, obviously we need to have
flushed the WAL past the recycled segment

b) Except for the first to-be-recycled segment, we just fsynced after the last
rename, so there won't be any filesystem journal data that needs to be
flushed

I'm not entirely sure about a) though - depending on mount options it's
possible that the fsync() will flush file modification times when using
wal_sync_method=fdatasync. But even if that's possibly reachable, I doubt
it'll be common, due to a checkpoint having to complete between the WAL flush
and recycling. Could be worth experimenting with.

Greetings,

Andres

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Dmitry Koval 2025-01-22 18:03:05 Re: Invalid index on partitioned table - is this a bug or feature?
Previous Message Alexander Kuzmenkov 2025-01-22 17:57:46 Re: Quadratic planning time for ordered paths over partitioned tables