Re: Improving the latch handling between logical replication launcher and worker processes.

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: vignesh21(at)gmail(dot)com
Cc: hlinnaka(at)iki(dot)fi, kuroda(dot)hayato(at)fujitsu(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Improving the latch handling between logical replication launcher and worker processes.
Date: 2024-09-04 03:02:14
Message-ID: 20240904.120214.2117610340510711755.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Tue, 3 Sep 2024 09:10:07 +0530, vignesh C <vignesh21(at)gmail(dot)com> wrote in
> The attached v2 version patch has the changes for the same.

Sorry for jumping in at this point. I've just reviewed the latest
patch (v2), and the frequent Own/Disown-Latch operations caught my
attention. Additionally, handling multiple concurrently operating
trigger sources with nested latch waits seems bug-prone, which I’d
prefer to avoid from both a readability and safety perspective.

With that in mind, I’d like to suggest an alternative approach. I may
not be fully aware of all the previous discussions, so apologies if
this idea has already been considered and dismissed.

Currently, WaitForReplicationWorkerAttach() and
logicalrep_worker_stop_internal() wait on a latch after verifying the
desired state. This ensures that even if there are spurious or missed
wakeups, they won't cause issues. In contrast, ApplyLauncherMain()
enters a latch wait without checking the desired state
first. Consequently, if another process sets the latch to wake up the
main loop while the former two functions are waiting, that wakeup
could be missed. If my understanding is correct, the problem lies in
ApplyLauncherMain() not checking the expected state before beginning
to wait on the latch. There is no issue with waiting if the state
hasn't been satisfied yet.

So, I propose that ApplyLauncherMain() should check the condition that
triggers a main loop wakeup before calling WaitLatch(). To do this, we
could add a flag in LogicalRepCtxStruct to signal that the main loop
has immediate tasks to handle. ApplyLauncherWakeup() would set this
flag before sending SIGUSR1. In turn, ApplyLauncherMain() would check
this flag before calling WaitLatch() and skip the WaitLatch() call if
the flag is set.

I think this approach could solve the issue without adding
complexity. What do you think?

regard.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-09-04 03:47:20 Re: Collect statistics about conflicts in logical replication
Previous Message Nathan Bossart 2024-09-04 01:41:49 Re: optimizing pg_upgrade's once-in-each-database steps