Re: Purpose of wal_init_zero

From: Ritu Bhandari <mailritubhandari(at)gmail(dot)com>
To: Andy Fan <zhihuifan1213(at)163(dot)com>
Cc: Andres Freund <andres(at)anarazel(dot)de>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Purpose of wal_init_zero
Date: 2025-01-16 09:20:57
Message-ID: CAPNLunXuOc_Oyrr-pRVRcjCwV-G28vC2g6P-23BjVxRoNf9vRg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Adding to Andy Fan's point above:

If we increase WAL segment size from 16MB to 64MB, initializing the 64MB
WAL segment inline can cause several seconds of freeze on all write
transactions when it happens. Writing out a newly zero-filled 64MB WAL
segment takes several seconds for smaller disk sizes.

Disk size (GB) throughput per GiB (MiBps) throughput (MiBps Time to write
64MB, seconds
10 0.48 5 13.33
32 0.48 15 4.17
64 0.48 31 2.08
128 0.48 61 1.04
256 0.48 123 0.52
500 0.48 240 0.27
834 0.48 400 0.16
1,000 0.48 480 0.13

Writing full 64MB zeroes every WAL file switch will not just cause general
performance degradation, but more concerningly also makes the workload more
"jittery", by stopping all WAL writes, so all write workloads, at every WAL
switch for the time it takes to zero-fill.

Also about WAL recycle, during our performance benchmarking, we noticed
that high volume of updates or inserts will tend to generate WAL faster
than standard checkpoint processes can keep up resulting in increased WAL
file creation (instead of rotation) and zero-filling, which significantly
degrades performance.

I see, PG once had fallocate [1] (which was reverted by [2] due to some
performance regression concern). The original OSS discussion was in [3].
The perf regression was reported in [4]. Looks like this was due to how
ext4 handled extents and uninitialized data[5] and that seems to be fixed
in [6]. I'll check with Theodore Ts'o to confirm on [6].

Could we consider adding back fallocate?

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=269e780
[2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=5b571bb
[3]
https://www.postgresql.org/message-id/flat/CAKuK5J0raLwOiKfSh5d8SxtCY2snJAMsfo6RGTBMfcQYB%2B-faQ%40mail.gmail.com
[4]
https://www.postgresql.org/message-id/flat/CAA-aLv7tYHDzMGg4HtDZh0RQZjJc2v2weJ-Obm4yvkw6ePe9Qw%40mail.gmail.com
[5]
https://www.postgresql.org/message-id/CAKuK5J3R-oBh%2B9f23Ko-0-gt5Zi1REgg7ng-awQuUsgiY2B7GQ%40mail.gmail.com
[6]
https://github.com/torvalds/linux/commit/b71fc079b5d8f42b2a52743c8d2f1d35d655b1c5

Thanks,
-Ritu

On Thu, 16 Jan 2025 at 12:01, Andy Fan <zhihuifan1213(at)163(dot)com> wrote:

>
> Hi,
>
> >
> > c=1 && \
> > psql -c checkpoint -c 'select pg_switch_wal()' && \
> > pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT
> pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 10000
> >
> > wal_init_zero = 1: 885 TPS
> > wal_init_zero = 0: 286 TPS.
>
> Your theory looks clear and the result is promsing. I can reproduce the
> similar result in my setup.
>
> on: tps = 1588.538378 (without initial connection time)
> off: tps = 857.755343 (without initial connection time)
>
> > Of course I chose this case to be intentionally extreme - each
> transaction
> > fills a bit more than one page of WAL and immediately flushes it. That
> > guarantees that each commit needs a seperate filesystem metadata flush
> and a
> > flush of the data for the fdatasync() at commit.
>
> However if I increase the clients from 1 to 64(this may break this
> extrme because of group commit) then we can see the wal_init_zero caused
> noticable regression.
>
> c=64 && \
> psql -c checkpoint -c 'select pg_switch_wal()' && \
> pgbench -n -M prepared -c$c -j$c -f <(echo "SELECT
> pg_logical_emit_message(true, 'test', repeat('0', 8192));";) -P1 -t 10000
>
> off:
> tps = 12135.110730 (without initial connection time)
> tps = 11964.016277 (without initial connection time)
> tps = 12078.458724 (without initial connection time)
>
> on:
> tps = 9392.374563 (without initial connection time)
> tps = 9391.916410 (without initial connection time)
> tps = 9390.503777 (without initial connection time)
>
> Now the wal_init_zero happens on the user backend and other backends also
> need to wait for it, this looks not good to me. I find walwriter doesn't
> do much things, I'd like to have a try if we can offload wal_init_zero
> to the walwriter.
>
> About the wal_recycle, IIUC, it can only recycle a wal file during
> Checkpoint, but checkpoint doesn't happens often.
>
> --
> Best Regards
> Andy Fan
>
>
>
>

In response to

Browse pgsql-hackers by date

  From Date Subject
Previous Message Vladlen Popolitov 2025-01-16 09:18:04 Re: SQL/JSON json_table plan clause