Re: Syncrep and improving latency due to WAL throttling

From: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Jakub Wartak <jakub(dot)wartak(at)enterprisedb(dot)com>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org>
Subject: Re: Syncrep and improving latency due to WAL throttling
Date: 2023-12-04 01:45:46
Message-ID: ead51688-958e-2f3b-ae72-baff0031a9c3@enterprisedb.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Since the last patch version I've done a number of experiments with this
throttling idea, so let me share some of the ideas and results, and see
where that gets us.

The patch versions so far tied everything to syncrep - commit latency
with sync replica was the original motivation, so this makes sense. But
while thinking about this and discussing this with a couple people, I've
been wondering why to limit this to just that particular option. There's
a couple other places in the WAL write path where we might do a similar
thing (i.e. wait) or be a bit more aggressive (and do a write/flush),
depending on circumstances.

If I simplify this a bit, there are about 3 WAL positions that I could
think of:

- write LSN (how far we wrote WAL to disk)
- flush LSN (how far we flushed WAL to local disk)
- syncrep LSN (how far the sync replica confirmed WAL)

So, why couldn't there be a similar "throttling threshold" for these
events too? Imagine we have three GUCs, with values satisfying this:

wal_write_after < wal_flush_after_local < wal_flush_after_remote

and this meaning:

wal_write_after - if a backend generates this amount of WAL, it will
write the completed WAL (but only whole pages)

wal_flush_after_local - if a backend generates this amount of WAL, it
will not only write the WAL, but also issue a
flush (if still needed)

wal_flush_after_remote - if this amount of WAL is generated, it will
wait for syncrep to confirm the flushed LSN

The attached PoC patch does this, mostly the same way as earlier
patches. XLogInsertRecord is where the decision whether throttling may
be needed is done, HandleXLogDelayPending then does the actual work
(writing WAL, flushing it, waiting for syncrep).

The one new thing HandleXLogDelayPending also does is auto-tuning the
values a bit. The idea is that with per-backend threshold, it's hard to
enforce some sort of global limit, because if depends on the number of
active backends. If you set 1MB of WAL per backend, the total might be
1MB or 1000MB, if there are 1000 backends. Who knows. So this tries to
reduce the threshold (if the backend generated only a tiny fraction of
the WAL), or increase the threshold (if it generated most of it). I'm
not entirely sure this behaves sanely under all circumstances, but for a
PoC patch it seems OK.

The first two GUCs remind me what walwriter is doing, and I've been
asking myself if maybe making it more aggressive would have the same
effect. But I don't think so, because a big part of this throttling
patch is ... well, throttling. Making the backends sleep for a bit (or
wait for something), to slow it down. And walwriter doesn't really do
that I think.

In a recent off-list discussion, someone asked if maybe this might be
useful to prevent emergencies due to archiver not keeping up and WAL
filling disk. A bit like enforcing a more "strict" limit on WAL than the
current max_wal_size GUC. I'm not sure about that, it's certainly a very
different use case than minimizing impact on OLTP latency. But it seems
like "archived LSN" might be another "event" the backends would wait
for, just like they wait for syncrep to confirm a LSN. Ideally it'd
never happen, ofc, and it seems a bit like a great footgun (outage on
archiver may kill PROD), but if you're at risk of ENOSPACE on pg_wal,
not doing anything may be risky too ...

FWIW I wonder if maybe we should frame this a as a QoS feature, where
instead of "minimize impact of bulk loads" we'd try to "guarantee" or
"reserve" some part of the capacity to certain backends/...

Now, let's look at results from some of the experiments. I wanted to see
how effective this approach could be in minimizing impact of large bulk
loads at small OLTP transactions in different setups. Thanks to the two
new GUCs this is not strictly about syncrep, so I decided to try three
cases:

1) local, i.e. single-node instance

2) syncrep on the same switch, with 0.1ms latency (1Gbit)

2) syncrep with 10ms latency (also 1Gbit)

And for each configuration I did ran a pgbench (30 minutes), either on
it's own, or concurrently with bulk COPY of 1GB data. The load was done
either by a single backend (so one backend loading 1GB of data), or the
file was split into 10 files 100MB each, and this was loaded by 10
concurrent backends.

And I did this test with three configurations:

(a) master - unpatched, current behavior

(b) throttle-1: patched with limits set like this:

# Add settings for extensions here
wal_write_after = '8kB'
wal_flush_after_local = '16kB'
wal_flush_after_remote = '32kB'

(c) throttle-2: patched with throttling limits set to 4x of (b), i.e.

# Add settings for extensions here
wal_write_after = '32kB'
wal_flush_after_local = '64kB'
wal_flush_after_remote = '128kB'

And I did this for the traditional three scales (small, medium, large),
to hit different bottlenecks. And of course, I measured both throughput
and latencies.

The full results are available here:

[1] https://github.com/tvondra/wal-throttle-results/tree/master

I'm not going to attach the files visualizing the results here, because
it's like 1MB per file, which is not great for e-mail.

https://github.com/tvondra/wal-throttle-results/blob/master/wal-throttling.pdf
----------------------------------------------------------------------

The first file summarizes the throughput results for the three
configurations, different scales etc. On the left is throughput, on the
right is the number of load cycles completed.

I think this behaves mostly as expected - with the bulk loads, the
throughput drops. How much depends on the configuration (for syncrep
it's far more pronounced). The throttling recovers a lot of it, at the
expense of doing fewer loads - and it's quite significant drop. But
that's expected, and it was kinda what this patch was about - prioritise
the small OLTP transactions by doing fewer loads. This is not a patch
that would magically inflate capacity of the system to do more things.

I however agree this does not really represent a typical production OLTP
system. Those systems don't run at 100% saturation, except for short
periods, certainly not if they're doing something latency sensitive. So
a somewhat realistic test would be pgbench throttled at 75% capacity,
leaving some spare capacity for the bulk loads.

I actually tried that, and there are results in [1], but the behavior is
pretty similar to what I'm describing here (except that the system does
actually manages to do more bulk loads, ofc).

https://raw.githubusercontent.com/tvondra/wal-throttle-results/master/syncrep/latencies-1000-full.eps
-----------------------------------------------------------------------
Now let's look at the second file, which shows latency percentiles for
the medium dataset on syncrep. The difference between master (on the
left) and the two throttling builds is pretty obvious. It's not exactly
the same as "no concurrent bulk loads" in the top row, but not far from it.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment Content-Type Size
v5-0001-v4.patch text/x-patch 21.7 KB
v5-0002-rework.patch text/x-patch 17.8 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Anton A. Melnikov 2023-12-04 01:49:59 Re: May be BUG. Periodic burst growth of the checkpoint_req counter on replica.
Previous Message Bharath Rupireddy 2023-12-04 01:29:13 Use PGAlignedBlock instead of "char buf[BLCKSZ]" in more places