| From: | vignesh C <vignesh21(at)gmail(dot)com> |
|---|---|
| To: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
| Cc: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
| Subject: | Re: Race condition in FetchTableStates() breaks synchronization of subscription tables |
| Date: | 2024-02-08 09:25:23 |
| Message-ID: | CALDaNm0X8oUiW1CzniPZsDxqjP-VoYuEvb1h7NFXohKc1P5HEw@mail.gmail.com |
| Views: | Whole Thread | Raw Message | Download mbox | Resend email |
| Thread: | |
| Lists: | pgsql-hackers |
On Tue, 6 Feb 2024 at 18:30, Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
>
> 05.02.2024 13:13, vignesh C wrote:
> > Thanks for the steps for the issue, I was able to reproduce this issue
> > in my environment with the steps provided. The attached patch has a
> > proposed fix where the latch will not be set in case of the apply
> > worker exiting immediately after starting.
>
> It looks like the proposed fix doesn't help when ApplyLauncherWakeup()
> called by a backend executing CREATE SUBSCRIPTION command.
> That is, with the v4-0002 patch applied and pg_usleep(300000L); added
> just below
> if (!worker_in_use)
> return worker_in_use;
> I still observe the test 027_nosuperuser running for 3+ minutes:
> t/027_nosuperuser.pl .. ok
> All tests successful.
> Files=1, Tests=19, 187 wallclock secs ( 0.01 usr 0.00 sys + 4.82 cusr 4.47 csys = 9.30 CPU)
>
> IIUC, it's because a launcher wakeup call, sent by "CREATE SUBSCRIPTION
> regression_sub ...", gets missed when launcher waits for start of another
> worker (logical replication worker for subscription "admin_sub"), launched
> just before that command.
Yes, the wakeup call sent by the "CREATE SUBSCRIPTION" command was
getting missed in this case. The wakeup call can be sent during
subscription creation/modification and when the apply worker exits.
WaitForReplicationWorkerAttach should not reset the latch here as it
will end up delaying the apply worker to get started after 180 seconds
timeout(DEFAULT_NAPTIME_PER_CYCLE). The attached patch does not reset
the latch and lets ApplyLauncherMain to reset the latch and checks if
any new worker or missing worker needs to be started.
Regards,
Vignesh
| Attachment | Content-Type | Size |
|---|---|---|
| v5-0002-Apply-worker-will-get-started-after-180-seconds-b.patch | text/x-patch | 2.2 KB |
| v5-0001-Table-sync-missed-for-newly-added-tables-because-.patch | text/x-patch | 3.3 KB |
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Hayato Kuroda (Fujitsu) | 2024-02-08 09:33:08 | RE: Improve eviction algorithm in ReorderBuffer |
| Previous Message | jian he | 2024-02-08 08:52:00 | Re: recently added jsonpath method change jsonb_path_query, jsonb_path_query_first immutability |