Re: Get rid of WALBufMappingLock

From: Alexander Korotkov <aekorotkov(at)gmail(dot)com>
To: Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru>
Cc: Tomas Vondra <tomas(at)vondra(dot)me>, Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Get rid of WALBufMappingLock
Date: 2025-03-31 18:18:30
Message-ID: CAPpHfdtOBgoUfGiw9exZPpS7e4EE7SdEBRt9VDfgYJpC2Jd5DA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 31, 2025 at 1:42 PM Yura Sokolov <y(dot)sokolov(at)postgrespro(dot)ru> wrote:
> 14.03.2025 17:30, Tomas Vondra wrote:
> > Hi,
> >
> > I've briefly looked at this patch this week, and done a bit of testing.
> > I don't have any comments about the correctness - it does seem correct
> > to me and I haven't noticed any crashes/issues, but I'm not familiar
> > with the WALBufMappingLock enough to have insightful opinions.
> >
> > I have however decided to do a bit of benchmarking, to better understand
> > the possible benefits of the change. I happen to have access to an Azure
> > machine with 2x AMD EPYC 9V33X (176 cores in total), and NVMe SSD that
> > can do ~1.5GB/s.
> >
> > The benchmark script (attached) uses the workload mentioned by Andres
> > some time ago [1]
> >
> > SELECT pg_logical_emit_message(true, 'test', repeat('0', $SIZE));
> >
> > with clients (1..196) and sizes 8K, 64K and 1024K. The aggregated
> > results look like this (this is throughput):
> >
> > | 8 | 64 | 1024
> > clients | master patched | master patched | master patched
> > ---------------------------------------------------------------------
> > 1 | 11864 12035 | 7419 7345 | 968 940
> > 4 | 26311 26919 | 12414 12308 | 1304 1293
> > 8 | 38742 39651 | 14316 14539 | 1348 1348
> > 16 | 57299 59917 | 15405 15871 | 1304 1279
> > 32 | 74857 82598 | 17589 17126 | 1233 1233
> > 48 | 87596 95495 | 18616 18160 | 1199 1227
> > 64 | 89982 97715 | 19033 18910 | 1196 1221
> > 96 | 92853 103448 | 19694 19706 | 1190 1210
> > 128 | 95392 103324 | 20085 19873 | 1188 1213
> > 160 | 94933 102236 | 20227 20323 | 1180 1214
> > 196 | 95933 103341 | 20448 20513 | 1188 1199
> >
> > To put this into a perspective, this throughput relative to master:
> >
> > clients | 8 64 1024
> > ----------------------------------
> > 1 | 101% 99% 97%
> > 4 | 102% 99% 99%
> > 8 | 102% 102% 100%
> > 16 | 105% 103% 98%
> > 32 | 110% 97% 100%
> > 48 | 109% 98% 102%
> > 64 | 109% 99% 102%
> > 96 | 111% 100% 102%
> > 128 | 108% 99% 102%
> > 160 | 108% 100% 103%
> > 196 | 108% 100% 101%
> >
> > That does not seem like a huge improvement :-( Yes, there's 1-10%
> > speedup for the small (8K) size, but for larger chunks it's a wash.
> >
> > Looking at the pgbench progress, I noticed stuff like this:
> >
> > ...
> > progress: 13.0 s, 103575.2 tps, lat 0.309 ms stddev 0.071, 0 failed
> > progress: 14.0 s, 102685.2 tps, lat 0.312 ms stddev 0.072, 0 failed
> > progress: 15.0 s, 102853.9 tps, lat 0.311 ms stddev 0.072, 0 failed
> > progress: 16.0 s, 103146.0 tps, lat 0.310 ms stddev 0.075, 0 failed
> > progress: 17.0 s, 57168.1 tps, lat 0.560 ms stddev 0.153, 0 failed
> > progress: 18.0 s, 50495.9 tps, lat 0.634 ms stddev 0.060, 0 failed
> > progress: 19.0 s, 50927.0 tps, lat 0.628 ms stddev 0.066, 0 failed
> > progress: 20.0 s, 50986.7 tps, lat 0.628 ms stddev 0.062, 0 failed
> > progress: 21.0 s, 50652.3 tps, lat 0.632 ms stddev 0.061, 0 failed
> > progress: 22.0 s, 63792.9 tps, lat 0.502 ms stddev 0.168, 0 failed
> > progress: 23.0 s, 103109.9 tps, lat 0.310 ms stddev 0.072, 0 failed
> > progress: 24.0 s, 103503.8 tps, lat 0.309 ms stddev 0.071, 0 failed
> > progress: 25.0 s, 101984.2 tps, lat 0.314 ms stddev 0.073, 0 failed
> > progress: 26.0 s, 102923.1 tps, lat 0.311 ms stddev 0.072, 0 failed
> > progress: 27.0 s, 103973.1 tps, lat 0.308 ms stddev 0.072, 0 failed
> > ...
> >
> > i.e. it fluctuates a lot. I suspected this is due to the SSD doing funny
> > things (it's a virtual SSD, I'm not sure what model is that behind the
> > curtains). So I decided to try running the benchmark on tmpfs, to get
> > the storage out of the way and get the "best case" results.
> >
> > This makes the pgbench progress perfectly "smooth" (no jumps like in the
> > output above), and the comparison looks like this:
> >
> > | 8 | 64 | 1024
> > clients | master patched | master patched | master patched
> > ---------|---------------------|--------------------|----------------
> > 1 | 32449 32032 | 19289 20344 | 3108 3081
> > 4 | 68779 69256 | 24585 29912 | 2915 3449
> > 8 | 79787 100655 | 28217 39217 | 3182 4086
> > 16 | 113024 148968 | 42969 62083 | 5134 5712
> > 32 | 125884 170678 | 44256 71183 | 4910 5447
> > 48 | 125571 166695 | 44693 76411 | 4717 5215
> > 64 | 122096 160470 | 42749 83754 | 4631 5103
> > 96 | 120170 154145 | 42696 86529 | 4556 5020
> > 128 | 119204 152977 | 40880 88163 | 4529 5047
> > 160 | 116081 152708 | 42263 88066 | 4512 5000
> > 196 | 115364 152455 | 40765 88602 | 4505 4952
> >
> > and the comparison to master:
> >
> > clients 8 64 1024
> > -----------------------------------------
> > 1 99% 105% 99%
> > 4 101% 122% 118%
> > 8 126% 139% 128%
> > 16 132% 144% 111%
> > 32 136% 161% 111%
> > 48 133% 171% 111%
> > 64 131% 196% 110%
> > 96 128% 203% 110%
> > 128 128% 216% 111%
> > 160 132% 208% 111%
> > 196 132% 217% 110%
> >
> > Yes, with tmpfs the impact looks much more significant. For 8K the
> > speedup is ~1.3x, for 64K it's up to ~2x, for 1M it's ~1.1x.
> >
> >
> > That being said, I wonder how big is the impact for practical workloads.
> > ISTM this workload is pretty narrow / extreme, it'd be much easier if we
> > had an example of a more realistic workload, benefiting from this. Of
> > course, it may be the case that there are multiple related bottlenecks,
> > and we'd need to fix all of them - in which case it'd be silly to block
> > the improvements on the grounds that it alone does not help.
> >
> > Another thought is that this is testing the "good case". Can anyone
> > think of a workload that would be made worse by the patch?
>
> I've made similar benchmark on system with two Xeon Gold 5220R with two
> Samsung SSD 970 PRO 1TB mirrored by md.
>
> Configuration changes:
> wal_sync_method = open_datasync
> full_page_writes = off
> synchronous_commit = off
> checkpoint_timeout = 1d
> max_connections = 1000
> max_wal_size = 4GB
> min_wal_size = 640MB
>
> I variated wal segment size (16MB and 64MB), wal_buffers (128kB, 16MB and
> 1GB) and record size (1kB, 8kB and 64kB).
>
> (I didn't bench 1MB record size, since I don't believe it is critical for
> performance).
>
> Here's results for 64MB segment size and 1GB wal_buffers:
>
> +---------+---------+------------+--------------+----------+
> | recsize | clients | master_tps | nowalbuf_tps | rel_perf |
> +---------+---------+------------+--------------+----------+
> | 1 | 1 | 47991.0 | 46995.0 | 0.98 |
> | 1 | 4 | 171930.0 | 171166.0 | 1.0 |
> | 1 | 16 | 491240.0 | 485132.0 | 0.99 |
> | 1 | 64 | 514590.0 | 515534.0 | 1.0 |
> | 1 | 128 | 547222.0 | 543543.0 | 0.99 |
> | 1 | 256 | 543353.0 | 540802.0 | 1.0 |
> | 8 | 1 | 40976.0 | 41603.0 | 1.02 |
> | 8 | 4 | 89003.0 | 92008.0 | 1.03 |
> | 8 | 16 | 90457.0 | 92282.0 | 1.02 |
> | 8 | 64 | 89293.0 | 92022.0 | 1.03 |
> | 8 | 128 | 92687.0 | 92768.0 | 1.0 |
> | 8 | 256 | 91874.0 | 91665.0 | 1.0 |
> | 64 | 1 | 11829.0 | 12031.0 | 1.02 |
> | 64 | 4 | 11959.0 | 12832.0 | 1.07 |
> | 64 | 16 | 11331.0 | 13417.0 | 1.18 |
> | 64 | 64 | 11108.0 | 13588.0 | 1.22 |
> | 64 | 128 | 11089.0 | 13648.0 | 1.23 |
> | 64 | 256 | 10381.0 | 13542.0 | 1.3 |
> +---------+---------+------------+--------------+----------+
>
> Numbers for all configurations in attached 'improvements.out' . It shows,
> removing WALBufMappingLock almost always doesn't harm performance and
> usually gives measurable gain.
>
> (Numbers are average from 4 middle runs out of 6. i.e. I threw minimum and
> maximum tps from 6 runs and took average from remaining).
>
> Also sqlite database is attached with all results. It also contains results
> for patch "Several attempts to lock WALInsertLock" (named "attempts") and
> cumulative patch ("nowalbuf-attempts").
> Suprisingly, "Several attempts" causes measurable impact in some
> configurations with hundreds of clients. So, there're more bottlenecks ahead ))
>
>
> Yes, it is still not "real-world" benchmark. But it at least shows patch is
> harmless.

Thank you for your experiments. Your results shows up to 30% speedups
on real hardware, not tmpfs. While this is still a corner case, I
think this is quite a results for a pretty local optimization. On
small connection number there are some cases above and below 1.0. I
think this due to statistical error. If we would calculate average
tps ratio across different experiments, for low number of clients it's
still above 1.0.

sqlite> select clients, avg(ratio) from (select walseg, walbuf,
recsize, clients, (avg(tps) filter (where branch =
'nowalbuf'))/(avg(tps) filter (where branch = 'master')) as ratio from
results where branch in ('master', 'nowalbuf') group by walseg,
walbuf, recsize, clients) x group by clients;
1|1.00546614169766
4|1.00782085856889
16|1.02257892337757
64|1.04400167838906
128|1.04134006876033
256|1.04627949500578

I'm going to push the first patch ("nowalbuf") if no objections. I
think the second one ("Several attempts") still needs more work, as
there are regressions.

------
Regards,
Alexander Korotkov
Supabase

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Fujii Masao 2025-03-31 18:20:36 Re: Truncate logs by max_log_size
Previous Message Melanie Plageman 2025-03-31 18:15:23 Re: Using read stream in autoprewarm