Re: Syncrep and improving latency due to WAL throttling

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Syncrep and improving latency due to WAL throttling
Date: 2023-11-08 12:59:55
Message-ID: 750b20d7-6a8c-4c3f-a4b3-23ad409b0046@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 11/8/23 07:40, Andres Freund wrote:
> Hi,
>
> On 2023-11-04 20:00:46 +0100, Tomas Vondra wrote:
>> scope
>> -----
>> Now, let's talk about scope - what the patch does not aim to do. The
>> patch is explicitly intended for syncrep clusters, not async. There have
>> been proposals to also support throttling for async replicas, logical
>> replication etc. I suppose all of that could be implemented, and I do
>> see the benefit of defining some sort of maximum lag even for async
>> replicas. But the agreement was to focus on the syncrep case, where it's
>> particularly painful, and perhaps extend it in the future.
>
> Perhaps we should take care to make the configuration extensible in that
> direction in the future?
>

Yes, if we can come up with a suitable configuration, that would work
for the other use cases. I don't have a very good idea what to do about
replicas that may not be connected, of have connected but need to catch
up. IMHO it would be silly to turn this into "almost a sync rep".

>
> Hm - is this feature really tied to replication, at all? Pretty much the same
> situation exists without. On an ok-ish local nvme I ran pgbench with 1 client
> and -P1. Guess where I started a VACUUM (on a fully cached table, so no
> continuous WAL flushes):
>
> progress: 64.0 s, 634.0 tps, lat 1.578 ms stddev 0.477, 0 failed
> progress: 65.0 s, 634.0 tps, lat 1.577 ms stddev 0.546, 0 failed
> progress: 66.0 s, 639.0 tps, lat 1.566 ms stddev 0.656, 0 failed
> progress: 67.0 s, 642.0 tps, lat 1.557 ms stddev 0.273, 0 failed
> progress: 68.0 s, 556.0 tps, lat 1.793 ms stddev 0.690, 0 failed
> progress: 69.0 s, 281.0 tps, lat 3.568 ms stddev 1.050, 0 failed
> progress: 70.0 s, 282.0 tps, lat 3.539 ms stddev 1.072, 0 failed
> progress: 71.0 s, 273.0 tps, lat 3.663 ms stddev 2.602, 0 failed
> progress: 72.0 s, 261.0 tps, lat 3.832 ms stddev 1.889, 0 failed
> progress: 73.0 s, 268.0 tps, lat 3.738 ms stddev 0.934, 0 failed
>
> At 32 clients we go from ~10k to 2.5k, with a full 2s of 0.
>
> Subtracting pg_current_wal_flush_lsn() from pg_current_wal_insert_lsn() the
> "good times" show a delay of ~8kB (note that this includes WAL records that
> are still being inserted). Once the VACUUM runs, it's ~2-3MB.
>
> The picture with more clients is similar.
>
> If I instead severely limit the amount of outstanding (but not the amount of
> unflushed) WAL by setting wal_buffers to 128, latency dips quite a bit less
> (down to ~400 instead of ~260 at 1 client, ~10k to ~5k at 32). Of course
> that's ridiculous and will completely trash performance in many other cases,
> but it shows that limiting the amount of outstanding WAL could help without
> replication as well. With remote storage, that'd likely be a bigger
> difference.
>

Yeah, that's an interesting idea. I think the idea of enforcing "maximum
lag" is somewhat general, the difference is against what LSN the lag is
measured. For the syncrep case it was about LSN confirmed by the
replica, what you described would measure it for either flush LSN or
write LSN (which would be the "outstanding" case I think).

I guess the remote storage is somewhat similar to the syncrep case, in
that the lag includes some network communication.

>
>> problems
>> --------
>> Now let's talk about some problems - both conceptual and technical
>> (essentially review comments for the patch).
>>
>> 1) The goal of the patch is to limit the impact on latency, but the
>> relationship between WAL amounts and latency may not be linear. But we
>> don't have a good way to predict latency, and WAL lag is the only thing
>> we have, so there's that. Ultimately, it's a best effort.
>
> It's indeed probably not linear. Realistically, to do better, we probably need
> statistics for the specific system in question - the latency impact will
> differ hugely between different storage/network.
>

True. I can imagine two ways to measure that.

We could have a standalone tool similar to pg_test_fsync that would
mimic how we write/flush WAL, and measure the latency for different
amounts flushed data. The DBA would then be responsible for somehow
using this to configure the database (perhaps the tool could calculate
some "optimal" value to flush).

Alternatively, we could collect timing in XLogFlush, so that we'd track
amount of data to flush + timing, and then use that to calculate
expected latency (e.g. by binning by data size and using average latency
for each bin). And then use that, somehow.

So you could say - maximum commit latency is 10ms, and the system would
be able to estimate that the maximum amount of unflushed WAL is 256kB,
and it'd enforce this distance.

Still only best offort, no guarantees, of course.

>
>> 2) The throttling is per backend. That makes it simple, but it means
>> that it's hard to enforce a global lag limit. Imagine the limit is 8MB,
>> and with a single backend that works fine - the lag should not exceed
>> the 8MB value. But if there are N backends, the lag could be up to
>> N-times 8MB, I believe. That's a bit annoying, but I guess the only
>> solution would be to have some autovacuum-like cost balancing, with all
>> backends (or at least those running large stuff) doing the checks more
>> often. I'm not sure we want to do that.
>
> Hm. The average case is likely fine - the throttling of the different backends
> will intersperse and flush more frequently - but the worst case is presumably
> part of the issue here. I wonder if we could deal with this by somehow
> offsetting the points at which backends flush at somehow.
>

If I understand correctly, you want to ensure the backends start
measuring the WAL from different LSNs, in order to "distribute" them
uniformly within the WAL (and not "coordinate" them, which leads to
higher lag).

I guess we could do that, say by flushing only up to

hash(pid) % maximum_allowed_lag

I'm not sure that'll really work, especially if the backends are
somewhat naturally "correlated".

Maybe it could work if the backends explicitly coordinated the flushes
to distribute them, but then how is that different from just doing what
autovacuum-like costing does in principle.

However, perhaps there's an "adaptive" way to do this - each backend
know how much WAL it produced since the last flush LSN, and it can
easily measure the actual lag (unflushed WAL). It could compare those,
and estimate what fraction of the lag it's likely responsible for. And
then adjust the "flush distance" based on that.

Imagine we aim for 1MB unflushed WAL, and you have two backends that
happen to execute at the same time. They both generate 1MB of WAL and
hit the throttling code at about the same size. They discover the actual
lag is not the desired 1MB but 2MB, so (requested_lag/actual_lag) = 0.5,
and they'd adjust the flush distance to 1/2MB. And from that point we
know the lag is 1MB even with two backends.

Then one of the backends terminates, and the other backend eventually
hits the 1/2MB limit again, but the desired lag is 1MB, and it doubles
the distance again.

Of course, in practice the behavior would be more complicated, thanks to
backends that generate WAL but don't really hit the threshold.

There'd probably also be some sort of ramp-up, i.e. the backed would not
start with the "full" 1MB limit, but perhaps something lower. Would need
to be careful to be high enough to ignore the OLTP transactions, though.

> I doubt we want to go for something autovacuum balancing like - that doesn't
> seem to work well - but I think we could take the amount of actually unflushed
> WAL into account when deciding whether to throttle. We have the necessary
> state in local memory IIRC. We'd have to be careful to not throttle every
> backend at the same time, or we'll introduce latency penalties that way. But
> what if we scaled synchronous_commit_wal_throttle_threshold depending on the
> amount of unflushed WAL? By still taking backendWalInserted into account, we'd
> avoid throttling everyone at the same time, but still would make throttling
> more aggressive depending on the amount of unflushed/unreplicated WAL.
>

Oh! Perhaps similar to the adaptive behavior I explained above?

>
>> 3) The actual throttling (flush and wait for syncrep) happens in
>> ProcessInterrupts(), which mostly works but it has two drawbacks:
>>
>> * It may not happen "early enough" if the backends inserts a lot of
>> XLOG records without processing interrupts in between.
>
> Does such code exist? And if so, is there a reason not to fix said code?
>

Not sure. I thought maybe index builds might do something like that, but
it doesn't seem to be the case (at least for the built-in indexes). But
if adding CHECK_FOR_INTERRUPTS to more places is an acceptable fix, I'm
OK with that.

>
>> * It may happen "too early" if the backend inserts enough WAL to need
>> throttling (i.e. sets XLogDelayPending), but then after processing
>> interrupts it would be busy with other stuff, not inserting more WAL.
>
>> I think ideally we'd do the throttling right before inserting the next
>> XLOG record, but there's no convenient place, I think. We'd need to
>> annotate a lot of places, etc. So maybe ProcessInterrupts() is a
>> reasonable approximation.
>
> Yea, I think there's no way to do that with reasonable effort. Starting to
> wait with a bunch of lwlocks held would obviously be bad.
>

OK.

>
>> We may need to add CHECK_FOR_INTERRUPTS() to a couple more places, but
>> that seems reasonable.
>
> And independently beneficial.
>

OK.

>
>> missing pieces
>> --------------
>> The thing that's missing is that some processes (like aggressive
>> anti-wraparound autovacuum) should not be throttled. If people set the
>> GUC in the postgresql.conf, I guess that'll affect those processes too,
>> so I guess we should explicitly reset the GUC for those processes. I
>> wonder if there are other cases that should not be throttled.
>
> Hm, that's a bit hairy. If we just exempt it we'll actually slow down everyone
> else even further, even though the goal of the feature might be the opposite.
> I don't think that's warranted for anti-wraparound vacuums - they're normal. I
> think failsafe vacuums are a different story - there we really just don't care
> about impacting other backends, the goal is to prevent moving the cluster to
> read only pretty soon.
>

Right, I confused those two autovacuum modes.

>
>> tangents
>> --------
>> While discussing this with Andres a while ago, he mentioned a somewhat
>> orthogonal idea - sending unflushed data to the replica.
>>
>> We currently never send unflushed data to the replica, which makes sense
>> because this data is not durable and if the primary crashes/restarts,
>> this data will disappear. But it also means there may be a fairly large
>> chunk of WAL data that we may need to send at COMMIT and wait for the
>> confirmation.
>>
>> He suggested we might actually send the data to the replica, but the
>> replica would know this data is not flushed yet and so would not do the
>> recovery etc. And at commit we could just send a request to flush,
>> without having to transfer the data at that moment.
>>
>> I don't have a very good intuition about how large the effect would be,
>> i.e. how much unflushed WAL data could accumulate on the primary
>> (kilobytes/megabytes?),
>
> Obviously heavily depends on the workloads. If you have anything with bulk
> writes it can be many megabytes.
>
>
>> and how big is the difference between sending a couple kilobytes or just a
>> request to flush.
>
> Obviously heavily depends on the network...
>

I know it depends on workload/network. I'm merely saying I don't have a
very good intuition what value would be suitable for a particular
workload / network.

>
> I used netperf's tcp_rr between my workstation and my laptop on a local 10Gbit
> network (albeit with a crappy external card for my laptop), to put some
> numbers to this. I used -r $s,100 to test sending a variable sized data to the
> other size, with the other side always responding with 100 bytes (assuming
> that'd more than fit a feedback response).
>
> Command:
> fields="request_size,response_size,min_latency,mean_latency,max_latency,p99_latency,transaction_rate"; echo $fields; for s in 10 100 1000 10000 100000 1000000;do netperf -P0 -t TCP_RR -l 3 -H alap5 -- -r $s,100 -o "$fields";done
>
> 10gbe:
>
> request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate
> 10 100 43 64.30 390 96 15526.084
> 100 100 57 75.12 428 122 13286.602
> 1000 100 47 74.41 270 108 13412.125
> 10000 100 89 114.63 712 152 8700.643
> 100000 100 167 255.90 584 312 3903.516
> 1000000 100 891 1015.99 2470 1143 983.708
>
>
> Same hosts, but with my workstation forced to use a 1gbit connection:
>
> request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate
> 10 100 78 131.18 2425 257 7613.416
> 100 100 81 129.25 425 255 7727.473
> 1000 100 100 162.12 1444 266 6161.388
> 10000 100 310 686.19 1797 927 1456.204
> 100000 100 1006 1114.20 1472 1199 896.770
> 1000000 100 8338 8420.96 8827 8498 118.410
>
> I haven't checked, but I'd assume that 100bytes back and forth should easily
> fit a new message to update LSNs and the existing feedback response. Even just
> the difference between sending 100 bytes and sending 10k (a bit more than a
> single WAL page) is pretty significant on a 1gbit network.
>

I'm on decaf so I may be a bit slow, but it's not very clear to me what
conclusion to draw from these numbers. What is the takeaway?

My understanding is that in both cases the latency is initially fairly
stable, independent of the request size. This applies to request up to
~1000B. And then the latency starts increasing fairly quickly, even
though it shouldn't hit the bandwidth (except maybe the 1MB requests).

I don't think it says we should be replicating WAL in tiny chunks,
because if you need to send a chunk of data it's always more efficient
to send it at once (compared to sending multiple smaller pieces). But if
we manage to send most of this "in the background", only leaving the
last small bit to be sent at the very end, that'd help.

> Of course, the relatively low latency between these systems makes this more
> pronounced than if this were a cross regional or even cross continental link,
> were the roundtrip latency is more likely to be dominated by distance rather
> than throughput.
>
> Testing between europe and western US:
> request_size response_size min_latency mean_latency max_latency p99_latency transaction_rate
> 10 100 157934 167627.12 317705 160000 5.652
> 100 100 161294 171323.59 324017 170000 5.530
> 1000 100 161392 171521.82 324629 170000 5.524
> 10000 100 163651 173651.06 328488 170000 5.456
> 100000 100 166344 198070.20 638205 170000 4.781
> 1000000 100 225555 361166.12 1302368 240000 2.568
>
>
> No meaningful difference before getting to 100k. But it's pretty easy to lag
> by 100k on a longer distance link...
>

Right.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2023-11-08 13:06:09 Re: pgsql: Clean up role created in new subscription test.
Previous Message Christoph Berg 2023-11-08 12:55:02 Re: meson documentation build open issues