Re: Improving the latch handling between logical replication launcher and worker processes.

From: Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
To: vignesh C <vignesh21(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Improving the latch handling between logical replication launcher and worker processes.
Date: 2024-07-05 13:08:07
Message-ID: ff0663d9-8011-420f-a169-efbf57327cb5@iki.fi
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 05/07/2024 14:07, vignesh C wrote:
> On Thu, 4 Jul 2024 at 16:52, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> wrote:
>>
>> I'm don't quite understand the problem we're trying to fix:
>>
>>> Currently the launcher's latch is used for the following: a) worker
>>> process attach b) worker process exit and c) subscription creation.
>>> Since this same latch is used for multiple cases, the launcher process
>>> is not able to handle concurrent scenarios like: a) Launcher started a
>>> new apply worker and waiting for apply worker to attach and b) create
>>> subscription sub2 sending launcher wake up signal. In this scenario,
>>> both of them will set latch of the launcher process, the launcher
>>> process is not able to identify that both operations have occurred 1)
>>> worker is attached 2) subscription is created and apply worker should
>>> be started. As a result the apply worker does not get started for the
>>> new subscription created immediately and gets started after the
>>> timeout of 180 seconds.
>>
>> I don't see how we could miss a notification. Yes, both events will set
>> the same latch. Why is that a problem?
>
> The launcher process waits for the apply worker to attach at
> WaitForReplicationWorkerAttach function. The launcher's latch is
> getting set concurrently for another create subscription and apply
> worker attached. The launcher now detects the latch is set while
> waiting at WaitForReplicationWorkerAttach, it resets the latch and
> proceed to the main loop and waits for DEFAULT_NAPTIME_PER_CYCLE(as
> the latch has already been reset). Further details are provided below.
>
> The loop will see that the new
>> worker has attached, and that the new subscription has been created, and
>> process both events. Right?
>
> Since the latch is reset at WaitForReplicationWorkerAttach, it skips
> processing the create subscription event.
>
> Slightly detailing further:
> In the scenario when we execute two concurrent create subscription
> commands, first CREATE SUBSCRIPTION sub1, followed immediately by
> CREATE SUBSCRIPTION sub2.
> In a few random scenarios, the following events may unfold:
> After the first create subscription command(sub1), the backend will
> set the launcher's latch because of ApplyLauncherWakeupAtCommit.
> Subsequently, the launcher process will reset the latch and identify
> the addition of a new subscription in the pg_subscription list. The
> launcher process will proceed to request the postmaster to start the
> apply worker background process (sub1) and request the postmaster to
> notify the launcher once the apply worker(sub1) has been started.
> Launcher will then wait for the apply worker(sub1) to attach in the
> WaitForReplicationWorkerAttach function.
> Meanwhile, the second CREATE SUBSCRIPTION command (sub2) which was
> executed concurrently, also set the launcher's latch(because of
> ApplyLauncherWakeupAtCommit).
> Simultaneously when the launcher remains in the
> WaitForReplicationWorkerAttach function waiting for apply worker of
> sub1 to start, the postmaster will start the apply worker for
> subscription sub1 and send a SIGUSR1 signal to the launcher process
> via ReportBackgroundWorkerPID. Upon receiving this signal, the
> launcher process will then set its latch.
>
> Now, the launcher's latch has been set concurrently after the apply
> worker for sub1 is started and the execution of the CREATE
> SUBSCRIPTION sub2 command.
>
> At this juncture, the launcher, which had been awaiting the attachment
> of the apply worker, detects that the latch is set and proceeds to
> reset it. The reset operation of the latch occurs within the following
> section of code in WaitForReplicationWorkerAttach:
> ...
> rc = WaitLatch(MyLatch,
> WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> 10L, WAIT_EVENT_BGWORKER_STARTUP);
>
> if (rc & WL_LATCH_SET)
> {
> ResetLatch(MyLatch);
> CHECK_FOR_INTERRUPTS();
> }
> ...
>
> After resetting the latch here, the activation signal intended to
> start the apply worker for subscription sub2 is no longer present. The
> launcher will return to the ApplyLauncherMain function, where it will
> await the DEFAULT_NAPTIME_PER_CYCLE, which is 180 seconds, before
> processing the create subscription request (i.e., creating a new apply
> worker for sub2).
> The issue arises from the latch being reset in
> WaitForReplicationWorkerAttach, which can occasionally delay the
> synchronization of table data for the subscription.

Ok, I see it now. Thanks for the explanation. So the problem isn't using
the same latch for different purposes per se. It's that we're trying to
use it in a nested fashion, resetting it in the inner loop.

Looking at the proposed patch more closely:

> @@ -221,13 +224,13 @@ WaitForReplicationWorkerAttach(LogicalRepWorker *worker,
> * We need timeout because we generally don't get notified via latch
> * about the worker attach. But we don't expect to have to wait long.
> */
> - rc = WaitLatch(MyLatch,
> + rc = WaitLatch(&worker->launcherWakeupLatch,
> WL_LATCH_SET | WL_TIMEOUT | WL_EXIT_ON_PM_DEATH,
> 10L, WAIT_EVENT_BGWORKER_STARTUP);
>
> if (rc & WL_LATCH_SET)
> {
> - ResetLatch(MyLatch);
> + ResetLatch(&worker->launcherWakeupLatch);
> CHECK_FOR_INTERRUPTS();
> }
> }

The comment here is now outdated, right? The patch adds an explicit
SetLatch() call to ParallelApplyWorkerMain(), to notify the launcher
about the attachment.

Is the launcherWakeupLatch field protected by some lock, to protect
which process owns it? OwnLatch() is called in
logicalrep_worker_stop_internal() without holding a lock for example. Is
there a guarantee that only one process can do that at the same time?

What happens if a process never calls DisownLatch(), e.g. because it
bailed out with an error while waiting for the worker to attach or stop?

As an alternative, smaller fix, I think we could do the attached. It
forces the launcher's main loop to do another iteration, if it calls
logicalrep_worker_launch(). That extra iteration should pick up any
missed notifications.

Your solution with an additional latch seems better because it makes
WaitForReplicationWorkerAttach() react more quickly, without the 10 s
wait. I'm surprised we have that in the first place, 10 s seems like a
pretty long time to wait for a parallel apply worker to start. Why was
that ever OK?

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachment Content-Type Size
logical-launcher-nested-latch-issue.patch text/x-patch 1.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Erik Rijkers 2024-07-05 13:16:12 Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions
Previous Message Etsuro Fujita 2024-07-05 12:56:28 Re: d844cd75a and postgres_fdw