Quick Links

High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues

From:	Alexey Makhmutov <a(dot)makhmutov(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues
Date:	2025-03-29 03:42:43
Message-ID:	076eb7bd-52e6-4a51-ba00-c744d027b15c@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello,

Recently we’ve faced an interesting situation on a standby environment
with configured cascade replication and large number (~100) of
configured walsenders. We’ve noticed a very high CPU consumption on such
environment with the most time-consuming operation being signal delivery
from startup recovery process to walsenders via WalSndWakeup invocations
from ApplyWalRecord in xlogrecovery.c.

The startup standby process notifies walsenders for downstream systems
using ConditionVariableBroadcast (CV), so only processes waiting on this
CV need to be contacted. However in case of high load we seems to be
hitting here a bottleneck anyway. The current implementation tries to
send notification after processing of each WAL record (i.e. during each
invocation of ApplyWalRecord), so this implies high rate of WalSndWakeup
invocations. At the same time, this also provides each walsender with
very small chunk of data to process, so almost every process will be
present in the CV wait list for the next iteration. As result, waiting
list should be always fully packed in such case, which additionally
reduces performance of WAL records processing by the standby instance.

To reproduce such behavior we could use a simple environment with three
servers: primary instance, attached physical standby and its downstream
server with large number of logical replication subscriptions. Attached
is the synthetic test case (test_scenario.zip) to reproduce this
behavior – script ‘test_prepare.sh’ could be used to create required
environment with test data and ‘test_execute.sh’ script executes
‘pgbench’ tool with simple updates against primary instance to trigger
replication to other servers. All CPUs on my machine could be completely
saturated with just about 6 clients after 30 seconds in the test. Please
check the environment properties at the top of these scripts before
running them, as they need to be updated in order to specify location
for installed PG build, target location for database instances creation
and used ports.

Thinking about possible ways to improve such case, some sort of batching
may be a working option. We could try to postpone sending notification
until recovery has applied some number of messages. This will reduce
rate of CV notifications and will also give receivers more data to
process, so they may not need to enter the CV wait state so often.
Counting applied records is not difficult, but the tricky part here is
to ensure that we do not postpone notifications for too long in case of
low load.

The first possible approach to limit such delay is to just ask
ReadRecord to return control back on the first iteration if there are
currently no more incoming records. From the standby recovery
perspective we are either reading and applying records from the upstream
instance or waiting for new message to become available. We could
introduce additional field in XLogPageReadPrivate to notify XlogPageRead
on first iteration of ReadRecord to perform non-blocking read and return
null message if no data is currently available. In this case we may
immediately notify senders and then switch to regular blocking read
invocations.

However, in this approach we could not control time required to apply a
single record and thus could not effectively control the resulting
delay. We can hit a message which requires a lot of time to apply (e.g.
‘drop database’), which will hold all messages in the batch from being
delivered to walsenders. Another possible option here is to use a timer
to schedule notification if we are either waiting for records or
applying them. So, in this case the WalSndWakeup will be invoked either
after applying certain number of messages or after expiration of timeout
since last notification.

We’ve tried to create a proof of concept for such approach and made
another finding: it seems, that current implementation of
ConditionVariable notification does not work correctly if process is
waiting on some CV while also trying to send a broadcast to some other
CV from the interrupt handler. The comments in code implies that such
case is supported, but in reality it could leave CV in the inconsistent
state and cause problems on subsequent invocations of CV broadcast. The
invocation of ConditionVariableBroadcast removes current process from
the wakeup list of CV referenced by the ‘cv_sleep_target’ variable and
then clears this variable as well. Once the ConditionVariableTimedSleep
method returns from the sleep it adds itself back to the wakeup list,
but then just returns control back to the client without restoring the
‘cv_sleep_target’ value. In this case subsequent invocation of the
ConditionVariableCancelSleep won’t be able to locate correct CV and
correctly complete the wait cycle (current process will remain in the
list), so the next invocation of ConditionVariableBroadcast for this CV
may fail on assertions (we are not expecting to find ourselves on the
list of processes which need to receive our signal). I think, that the
correct approach here is to just return control to the client just after
the wait if ‘cv_sleep_target’ has changed. We do not need to add
ourselves to the wakeup list in such case, as client will check exit
condition in the cycle and then either invoke
ConditionVariableCancelSleep or ConditionVariableTimedSleep again. Both
methods will correctly update the CV state, so we won’t miss any
notifications or leave the CV in the improper state.

We’ve prepared two test patches on top of current master to address both
issues:
* 0001-CV-correctly-handle-cv_sleep_target-change.patch – adjust CV
logic to correctly process case with ‘cv_sleep_target’ change during sleep.
* 0002-Implement-batching-for-cascade-replication.patch – test patch to
implement possible batching approach in xlogreceiver.c with timer.
Currently it uses GUC variables to allow testing of different batch
sizes and timeout values.

With both patches applied we’ve noticed a significant reduction in CPU
consumption while using the synthetic test program mentioned above. If
only the second patch is applied to the build, then problem with CV
variable could be reproduced by building server with casserts enabled
and running TAP tests in ‘src/test/recovery’ directory.

It would be great to hear any thoughts on these observations and fixing
approaches, as well as possible pitfalls of proposed changes.

Thanks,
Alexey

Attachment	Content-Type	Size
0001-CV-correctly-handle-cv_sleep_target-change.patch	text/x-patch	2.3 KB
0002-Implement-batching-for-cascade-replication.patch	text/x-patch	7.8 KB
test_scenario.zip	application/zip	2.2 KB

Responses

Re: High CPU consumption in cascade replication with large number of walsenders and ConditionVariable broadcast issues at 2025-03-30 22:19:29 from Alexey Makhmutov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Mahendra Singh Thalor	2025-03-29 05:17:21	Re: Non-text mode for pg_dumpall
Previous Message	Andres Freund	2025-03-29 03:35:23	Re: AIO v2.5