Re: Syncrep and improving latency due to WAL throttling

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Syncrep and improving latency due to WAL throttling
Date: 2023-11-08 06:40:08
Message-ID: 20231108064008.nh4wb5g5ucwffeil@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
> scope
> -----
> Now, let's talk about scope - what the patch does not aim to do. The
> patch is explicitly intended for syncrep clusters, not async. There have
> been proposals to also support throttling for async replicas, logical
> replication etc. I suppose all of that could be implemented, and I do
> see the benefit of defining some sort of maximum lag even for async
> replicas. But the agreement was to focus on the syncrep case, where it's
> particularly painful, and perhaps extend it in the future.

Perhaps we should take care to make the configuration extensible in that
direction in the future?

Hm - is this feature really tied to replication, at all? Pretty much the same
situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
and -P1. Guess where I started a VACUUM (on a fully cached table, so no
continuous WAL flushes):

progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed

At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.

Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
"good times" show a delay of ~8kB (note that this includes WAL records that
are still being inserted). Once the VACUUM runs, it's ~2-3MB.

The picture with more clients is similar.

If I instead severely limit the amount of outstanding (but not the amount of
unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
(down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32). Of course
that's ridiculous and will completely trash performance in many other cases,
but it shows that limiting the amount of outstanding WAL could help without
replication as well. With remote storage, that'd likely be a bigger
difference.

> problems
> --------
> Now let's talk about some problems - both conceptual and technical
> (essentially review comments for the patch).
>
> 1) The goal of the patch is to limit the impact on latency, but the
> relationship between WAL amounts and latency may not be linear. But we
> don't have a good way to predict latency, and WAL lag is the only thing
> we have, so there's that. Ultimately, it's a best effort.

It's indeed probably not linear. Realistically, to do better, we probably need
statistics for the specific system in question - the latency impact will
differ hugely between different storage/network.

> 2) The throttling is per backend. That makes it simple, but it means
> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
> and with a single backend that works fine - the lag should not exceed
> the 8MB value. But if there are N backends, the lag could be up to
> N-times 8MB, I believe. That's a bit annoying, but I guess the only
> solution would be to have some autovacuum-like cost balancing, with all
> backends (or at least those running large stuff) doing the checks more
> often. I'm not sure we want to do that.

Hm. The average case is likely fine - the throttling of the different backends
will intersperse and flush more frequently - but the worst case is presumably
part of the issue here. I wonder if we could deal with this by somehow
offsetting the points at which backends flush at somehow.

I doubt we want to go for something autovacuum balancing like - that doesn't
seem to work well - but I think we could take the amount of actually unflushed
WAL into account when deciding whether to throttle. We have the necessary
state in local memory IIRC. We'd have to be careful to not throttle every
backend at the same time, or we'll introduce latency penalties that way. But
what if we scaled synchronous_commit_wal_throttle_threshold depending on the
amount of unflushed WAL? By still taking backendWalInserted into account, we'd
avoid throttling everyone at the same time, but still would make throttling
more aggressive depending on the amount of unflushed/unreplicated WAL.

> 3) The actual throttling (flush and wait for syncrep) happens in
> ProcessInterrupts(), which mostly works but it has two drawbacks:
>
> * It may not happen "early enough" if the backends inserts a lot of
> XLOG records without processing interrupts in between.

Does such code exist? And if so, is there a reason not to fix said code?

> * It may happen "too early" if the backend inserts enough WAL to need
> throttling (i.e. sets XLogDelayPending), but then after processing
> interrupts it would be busy with other stuff, not inserting more WAL.

> I think ideally we'd do the throttling right before inserting the next
> XLOG record, but there's no convenient place, I think. We'd need to
> annotate a lot of places, etc. So maybe ProcessInterrupts() is a
> reasonable approximation.

Yea, I think there's no way to do that with reasonable effort. Starting to
wait with a bunch of lwlocks held would obviously be bad.

> We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but
> that seems reasonable.

And independently beneficial.

> missing pieces
> --------------
> The thing that's missing is that some processes (like aggressive
> anti-wraparound autovacuum) should not be throttled. If people set the
> GUC in the postgresql.conf, I guess that'll affect those processes too,
> so I guess we should explicitly reset the GUC for those processes. I
> wonder if there are other cases that should not be throttled.

Hm, that's a bit hairy. If we just exempt it we'll actually slow down everyone
else even further, even though the goal of the feature might be the opposite.
I don't think that's warranted for anti-wraparound vacuums - they're normal. I
think failsafe vacuums are a different story - there we really just don't care
about impacting other backends, the goal is to prevent moving the cluster to
read only pretty soon.

> tangents
> --------
> While discussing this with Andres a while ago, he mentioned a somewhat
> orthogonal idea - sending unflushed data to the replica.
>
> We currently never send unflushed data to the replica, which makes sense
> because this data is not durable and if the primary crashes/restarts,
> this data will disappear. But it also means there may be a fairly large
> chunk of WAL data that we may need to send at COMMIT and wait for the
> confirmation.
>
> He suggested we might actually send the data to the replica, but the
> replica would know this data is not flushed yet and so would not do the
> recovery etc. And at commit we could just send a request to flush,
> without having to transfer the data at that moment.
>
> I don't have a very good intuition about how large the effect would be,
> i.e. how much unflushed WAL data could accumulate on the primary
> (kilobytes/megabytes?),

Obviously heavily depends on the workloads. If you have anything with bulk
writes it can be many megabytes.

> and how big is the difference between sending a couple kilobytes or just a
> request to flush.

Obviously heavily depends on the network...

I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
network (albeit with a crappy external card for my laptop), to put some
numbers to this. I used -r $s,100 to test sending a variable sized data to the
other size, with the other side always responding with 100 bytes (assuming
that'd more than fit a feedback response).

Command:
fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields; for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done

10gbe:

request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate
10 100 43 64.30 390 96 15526.084
100 100 57 75.12 428 122 13286.602
1000 100 47 74.41 270 108 13412.125
10000 100 89 114.63 712 152 8700.643
100000 100 167 255.90 584 312 3903.516
1000000 100 891 1015.99 2470 1143 983.708

Same hosts, but with my workstation forced to use a 1gbit connection:

request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate
10 100 78 131.18 2425 257 7613.416
100 100 81 129.25 425 255 7727.473
1000 100 100 162.12 1444 266 6161.388
10000 100 310 686.19 1797 927 1456.204
100000 100 1006 1114.20 1472 1199 896.770
1000000 100 8338 8420.96 8827 8498 118.410

I haven't checked, but I'd assume that 100bytes back and forth should easily
fit a new message to update LSNs and the existing feedback response. Even just
the difference between sending 100 bytes and sending 10k (a bit more than a
single WAL page) is pretty significant on a 1gbit network.

Of course, the relatively low latency between these systems makes this more
pronounced than if this were a cross regional or even cross continental link,
were the roundtrip latency is more likely to be dominated by distance rather
than throughput.

Testing between europe and western US:
request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate
10 100 157934 167627.12 317705 160000 5.652
100 100 161294 171323.59 324017 170000 5.530
1000 100 161392 171521.82 324629 170000 5.524
10000 100 163651 173651.06 328488 170000 5.456
100000 100 166344 198070.20 638205 170000 4.781
1000000 100 225555 361166.12 1302368 240000 2.568

No meaningful difference before getting to 100k. But it's pretty easy to lag
by 100k on a longer distance link...

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2023-11-08 06:55:37 ensure, not insure
Previous Message vignesh C 2023-11-08 06:21:06 Re: pg_upgrade and logical replication