Re: Purpose of wal_init_zero

From: Andres Freund <andres(at)anarazel(dot)de>
To: Ritu Bhandari <mailritubhandari(at)gmail(dot)com>
Cc: Andy Fan <zhihuifan1213(at)163(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Purpose of wal_init_zero
Date: 2025-01-17 21:29:14
Message-ID: pb7zxi7gtcxn5xsj2qbxdgcg5eqkn7y45gfcyu4g6c5tp4fcbz@by53xtagmqwj
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-01-16 14:50:57 +0530, Ritu Bhandari wrote:
> Adding to Andy Fan's point above:
>
> If we increase WAL segment size from 16MB to 64MB, initializing the 64MB
> WAL segment inline can cause several seconds of freeze on all write
> transactions when it happens. Writing out a newly zero-filled 64MB WAL
> segment takes several seconds for smaller disk sizes.
>
> Disk size (GB) throughput per GiB (MiBps) throughput (MiBps Time to write
> 64MB, seconds
> 10 0.48 5 13.33
> 32 0.48 15 4.17
> 64 0.48 31 2.08
> 128 0.48 61 1.04
> 256 0.48 123 0.52
> 500 0.48 240 0.27
> 834 0.48 400 0.16
> 1,000 0.48 480 0.13
>
>
> Writing full 64MB zeroes every WAL file switch will not just cause general
> performance degradation, but more concerningly also makes the workload more
> "jittery", by stopping all WAL writes, so all write workloads, at every WAL
> switch for the time it takes to zero-fill.

I agree. But I don't think a ~2x reduction in common cases is going to be an
OK price to default to disabling wal init.

I think what we instead ought to do is to more aggressively initialize WAL
files ahead of time, so it doesn't happen while holding crucial locks. We
know the recent rate of WAL generation, and we could easily track up to which
LSN we have recycled WAL segments. Armed with that information walwriter (or
something else) should try to ensure that there's always a fair amount of
pre-allocated WAL.

If your disk only has a sequential write speed of 4.8MB/s, I don't think any
nontrivial database workload is going to work well. And it obviously makes no
sense whatsoever to increase the WAL segment size on such systems.

I don't think we really can the smallest disks in your list work well -
there's only so much we can do given the low limits and we can probably invest
our time much more fruitfully by focusing on systems with disks speeds that
aren't slower than spinning rust from the 1990's.

That's not to say it's not worth working on preallocating WAL files. But
that's not going to help much if initializing a single WAL segment is going to
eat the entire bandwidth budget for 10+ seconds.

> Also about WAL recycle, during our performance benchmarking, we noticed
> that high volume of updates or inserts will tend to generate WAL faster
> than standard checkpoint processes can keep up resulting in increased WAL
> file creation (instead of rotation) and zero-filling, which significantly
> degrades performance.

I'm not sure I understand the specifics here - did the high WAL generation
rate result in the recycling taking too long? Or did checkpointer take too
long to write out data, and because of that recycling didn't happen frequently
enough?

> I see, PG once had fallocate [1] (which was reverted by [2] due to some
> performance regression concern). The original OSS discussion was in [3].
> The perf regression was reported in [4]. Looks like this was due to how
> ext4 handled extents and uninitialized data[5] and that seems to be fixed
> in [6]. I'll check with Theodore Ts'o to confirm on [6].
>
> Could we consider adding back fallocate?

Fallocate doesn't rally help unfortunately. On common filesystems (like
ext4/xfs) it just allocates filespace without zeroing out the underlying
blocks. To make that correct, those filesystems keep a bitmap indicating which
blocks in the range are not yet written. Unfortunately updating those blocks
is a metadata operation and thus requires journaling.

I've seen some mild speedups by first using fallocate and then zeroing out the
file, particularly with larger segment sizes. I think mainly due to avoiding
delayed allocation in the filesystem, rather than actually reducing
fragmentation. But it really isn't a whole lot.

I've in the past tried to get the linux filesytem developers to add an
fallocate mode that doesn't utilize the "unwritten extents" "optimization",
but didn't have luck with that. The block layer in linux actually does have
support for zeroing out regions of blocks without having to do actually write
the data, but it's only used in some narrow cases (don't remember the
details).

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Marcos Pegoraro 2025-01-17 21:30:40 Re: Fwd: Re: proposal: schema variables
Previous Message Robert Haas 2025-01-17 21:16:19 Re: Eager aggregation, take 3