Quick Links

Re: Pre-allocating WAL files

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Andy Fan <zhihuifan1213(at)163(dot)com>
Cc:	Nathan Bossart <nathandbossart(at)gmail(dot)com>, Justin Pryzby <pryzby(at)telsasoft(dot)com>, Maxim Orlov <orlovmg(at)gmail(dot)com>, Pavel Borisov <pashkin(dot)elfe(at)gmail(dot)com>, "Bossart, Nathan" <bossartn(at)amazon(dot)com>, Maxim Orlov <m(dot)orlov(at)postgrespro(dot)ru>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Pre-allocating WAL files
Date:	2025-01-22 16:21:03
Message-ID:	745zvagaf6tfn2zbvtmqvxsr6kmybjbzcw6rsm55jnql4233je@tudeipoewruo
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi,

On 2025-01-22 01:14:22 +0000, Andy Fan wrote:
> Andres Freund <andres(at)anarazel(dot)de> writes:
> > FWIW, I've seen the fsyncs around recycling being a rather substantial
> > bottleneck. To the point of the main benefit of larger segments being the
> > reduction in number of fsyncs at the end of a checkpoint. I think we should
> > be able to make the fsyncs a lot more efficient by batching them, first rename
> > a bunch of files, then fsync them and the directory. The current pattern
> > bascially requires a separate filesystem jouranl flush for each WAL segment.
>
> For education purpose, how to fsync files in batch? 'man fsync' tells me
> user can only fsync one file each time.
>
> int fsync(int fd);
>
> The fsync manual seems not saying fsync on a directory would fsync all
> the files under that directory.

Right now we do something that essentially boils down to

// recycle WAL file oldname1
fsync(open(oldname1));
rename(oldname1, newname1);
fsync(open(newname1));
fsync(open("pg_wal"));

// recycle WAL file oldname2
fsync(open(oldname2));
rename(oldname2, newname2);
fsync(open(newname2));
fsync(open("pg_wal"));
...

// recycle WAL file oldnameN
fsync(open(oldnameN));
rename(oldnameN, newnameN);
fsync(open(newnameN));
fsync(open("pg_wal"));
...

Most of the time the fsync on oldname won't have to do any IO (because
presumably we'll have flushed it before), but the rename obviously requires a
metadata update and thus the fsync will have work to do (whether it's the
fsync on newname or the directory will differ between filesystems).

This pattern basically forces the filesystem to do at least one journal flush
for every single WAL segment. I.e. each recycled segment will have at least
the latency of a single synchronous durable write IO.

But if we instead change it to something like this:

fsync(open(oldname1));
fsync(open(oldname2));
..
fsync(open(oldnameN));

rename(oldname1, newname1);
rename(oldname2, newname2);
..
rename(oldnameN, newnameN);

fsync(open(newname1));
fsync(open(newname2));
..
fsync(open(newnameN));

fsync(open("pg_wal"));

Most filesystems will be able to combine many of the the journal flushes
triggered by the renames into much bigger journal flushes. That means the
overall time for recycling is much lower than the earlier one, since there are
far fewer synchronous durable writes.

Here's a rough approximation of the effect using shell commands:

andres(at)awork3:/srv/dev/renamet$ rm -f test.*; N=1000; time (for i in $(seq 1 $N); do echo test > test.$i.old; done;sync; for i in $(seq 1 $N); do mv test.$i.old test.$i.new; sync; done;)

real 0m7.218s
user 0m0.431s
sys 0m4.892s

andres(at)awork3:/srv/dev/renamet$ rm -f test.*; N=1000; time (for i in $(seq 1 $N); do echo test > test.$i.old; done;sync; for i in $(seq 1 $N); do mv test.$i.old test.$i.new; done; sync)

real 0m2.678s
user 0m0.282s
sys 0m2.402s

The only difference between the two versions is that the latter can combine
the journal flushes, due to the sync happening outside of the loop.

This is a somewhat poor approximation of how this would work in postgres,
including likely exaggerating the gain (I think sync flushes the filesystem
superblock too), but it does show the principle.

Greetings,

Andres Freund

In response to

Re: Pre-allocating WAL files at 2025-01-22 01:14:22 from Andy Fan

Responses

Re: Pre-allocating WAL files at 2025-01-22 17:43:20 from Nathan Bossart

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Nathan Bossart	2025-01-22 16:22:45	Re: Converting pqsignal to void return
Previous Message	Sergey Tatarintsev	2025-01-22 16:13:57	Re: create subscription with (origin = none, copy_data = on)