Re: Deadlock between logrep apply worker and tablesync worker

From: vignesh C <vignesh21(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Deadlock between logrep apply worker and tablesync worker
Date: 2023-01-30 04:16:45
Message-ID: CALDaNm2uMjh+oeDZx9M=A6X5Fkp0XY0FvHTTHLzM4pVct1uTMg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Fri, 27 Jan 2023 at 17:46, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Jan 27, 2023 at 3:45 PM vignesh C <vignesh21(at)gmail(dot)com> wrote:
> >
> > On Mon, 23 Jan 2023 at 10:52, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > IIRC, this is done to prevent concurrent drops of origin drop say by
> > > exposed API pg_replication_origin_drop(). See the discussion in [1]
> > > related to it. If we want we can optimize it so that we can acquire
> > > the lock on the specific origin as mentioned in comments
> > > replorigin_drop_by_name() but it was not clear that this operation
> > > would be frequent enough.
> >
> > Here is an attached patch to lock the replication origin record using
> > LockSharedObject instead of locking pg_replication_origin relation in
> > ExclusiveLock mode. Now tablesync worker will wait only if the
> > tablesync worker is trying to drop the same replication origin which
> > has already been dropped by the apply worker, the other tablesync
> > workers will be able to successfully drop the replication origin
> > without any wait.
> >
>
> There is a code in the function replorigin_drop_guts() that uses the
> functionality introduced by replorigin_exists(). Can we reuse this
> function for the same?
>
> Also, it would be good if you can share the numbers for different runs
> of "src/test/subscription/t/002_types.pl" before and after the patch.

By using only Tom Lane's Fix, I noticed that the execution time is
varying between 3.6 seconds to 4.4 seconds.
By using only the "Changing to LockSharedObject" fix, I noticed that
the execution time is varying between 3.8 seconds to 4.6 seconds.
By using the combined Fix(Tom Lane's fix + changing to
LockSharedObject) , I noticed that the execution time is varying
between 3.5 seconds to 3.8 seconds.
I felt both the changes will be required as it will also handle the
scenario when both the apply worker and the table sync worker try to
drop the same replication origin.

The execution results for the same:
With only Tom Lane's fix:
[12:25:32] t/002_types.pl .. ok 3604 ms ( 0.00 usr 0.00 sys +
2.26 cusr 0.37 csys = 2.63 CPU)
[12:25:48] t/002_types.pl .. ok 3788 ms ( 0.00 usr 0.00 sys +
2.24 cusr 0.39 csys = 2.63 CPU)
[12:26:01] t/002_types.pl .. ok 3783 ms ( 0.00 usr 0.00 sys +
2.42 cusr 0.37 csys = 2.79 CPU)
[12:26:14] t/002_types.pl .. ok 3845 ms ( 0.00 usr 0.00 sys +
2.38 cusr 0.44 csys = 2.82 CPU)
[12:26:29] t/002_types.pl .. ok 3923 ms ( 0.00 usr 0.00 sys +
2.54 cusr 0.39 csys = 2.93 CPU)
[12:26:42] t/002_types.pl .. ok 4416 ms ( 0.00 usr 0.00 sys +
2.73 cusr 0.48 csys = 3.21 CPU)
[12:26:55] t/002_types.pl .. ok 4310 ms ( 0.00 usr 0.00 sys +
2.62 cusr 0.39 csys = 3.01 CPU)
[12:27:09] t/002_types.pl .. ok 4168 ms ( 0.00 usr 0.00 sys +
2.67 cusr 0.46 csys = 3.13 CPU)
[12:27:21] t/002_types.pl .. ok 4167 ms ( 0.00 usr 0.00 sys +
2.46 cusr 0.53 csys = 2.99 CPU)
[12:27:34] t/002_types.pl .. ok 4144 ms ( 0.00 usr 0.00 sys +
2.59 cusr 0.41 csys = 3.00 CPU)
[12:27:46] t/002_types.pl .. ok 3982 ms ( 0.00 usr 0.00 sys +
2.52 cusr 0.41 csys = 2.93 CPU)
[12:28:03] t/002_types.pl .. ok 4190 ms ( 0.01 usr 0.00 sys +
2.67 cusr 0.46 csys = 3.14 CPU)

With only "Changing to LockSharedObject" fix:
[12:33:02] t/002_types.pl .. ok 3815 ms ( 0.00 usr 0.00 sys +
2.30 cusr 0.38 csys = 2.68 CPU)
[12:33:16] t/002_types.pl .. ok 4295 ms ( 0.00 usr 0.00 sys +
2.66 cusr 0.42 csys = 3.08 CPU)
[12:33:31] t/002_types.pl .. ok 4270 ms ( 0.00 usr 0.00 sys +
2.72 cusr 0.44 csys = 3.16 CPU)
[12:33:44] t/002_types.pl .. ok 4460 ms ( 0.00 usr 0.00 sys +
2.78 cusr 0.45 csys = 3.23 CPU)
[12:33:58] t/002_types.pl .. ok 4340 ms ( 0.01 usr 0.00 sys +
2.67 cusr 0.45 csys = 3.13 CPU)
[12:34:11] t/002_types.pl .. ok 4142 ms ( 0.00 usr 0.00 sys +
2.58 cusr 0.42 csys = 3.00 CPU)
[12:34:24] t/002_types.pl .. ok 4459 ms ( 0.00 usr 0.00 sys +
2.76 cusr 0.49 csys = 3.25 CPU)
[12:34:38] t/002_types.pl .. ok 4427 ms ( 0.00 usr 0.00 sys +
2.68 cusr 0.48 csys = 3.16 CPU)
[12:35:10] t/002_types.pl .. ok 4642 ms ( 0.00 usr 0.00 sys +
2.84 cusr 0.55 csys = 3.39 CPU)
[12:35:22] t/002_types.pl .. ok 4047 ms ( 0.01 usr 0.00 sys +
2.49 cusr 0.46 csys = 2.96 CPU)
[12:35:32] t/002_types.pl .. ok 4505 ms ( 0.01 usr 0.00 sys +
2.90 cusr 0.45 csys = 3.36 CPU)
[12:36:03] t/002_types.pl .. ok 4088 ms ( 0.00 usr 0.00 sys +
2.51 cusr 0.42 csys = 2.93 CPU)

002_types with combination of Tom Lane's and "Changing to LockSharedObject" fix:
[10:22:04] t/002_types.pl .. ok 3730 ms ( 0.00 usr 0.00 sys +
2.30 cusr 0.41 csys = 2.71 CPU)
[10:23:40] t/002_types.pl .. ok 3666 ms ( 0.00 usr 0.00 sys +
2.16 cusr 0.42 csys = 2.58 CPU)
[10:23:31] t/002_types.pl .. ok 3665 ms ( 0.00 usr 0.00 sys +
2.31 cusr 0.40 csys = 2.71 CPU)
[10:23:23] t/002_types.pl .. ok 3500 ms ( 0.00 usr 0.00 sys +
2.20 cusr 0.36 csys = 2.56 CPU)
[10:23:14] t/002_types.pl .. ok 3704 ms ( 0.00 usr 0.00 sys +
2.36 cusr 0.35 csys = 2.71 CPU)
[10:23:05] t/002_types.pl .. ok 3594 ms ( 0.00 usr 0.00 sys +
2.32 cusr 0.31 csys = 2.63 CPU)
[10:24:10] t/002_types.pl .. ok 3702 ms ( 0.00 usr 0.00 sys +
2.27 cusr 0.42 csys = 2.69 CPU)
[10:24:22] t/002_types.pl .. ok 3741 ms ( 0.00 usr 0.00 sys +
2.39 cusr 0.36 csys = 2.75 CPU)
[10:24:38] t/002_types.pl .. ok 3676 ms ( 0.00 usr 0.00 sys +
2.28 cusr 0.43 csys = 2.71 CPU)
[10:24:50] t/002_types.pl .. ok 3843 ms ( 0.00 usr 0.00 sys +
2.36 cusr 0.43 csys = 2.79 CPU)
[10:25:03] t/002_types.pl .. ok 3710 ms ( 0.00 usr 0.00 sys +
2.30 cusr 0.36 csys = 2.66 CPU)
[10:25:12] t/002_types.pl .. ok 3695 ms ( 0.00 usr 0.00 sys +
2.34 cusr 0.35 csys = 2.69 CPU)

002_types on HEAD:
[10:31:05] t/002_types.pl .. ok 5687 ms ( 0.00 usr 0.00 sys +
2.35 cusr 0.45 csys = 2.80 CPU)
[10:31:31] t/002_types.pl .. ok 6815 ms ( 0.00 usr 0.00 sys +
2.61 cusr 0.43 csys = 3.04 CPU)
[10:31:47] t/002_types.pl .. ok 5561 ms ( 0.00 usr 0.00 sys +
2.24 cusr 0.47 csys = 2.71 CPU)
[10:32:05] t/002_types.pl .. ok 4542 ms ( 0.00 usr 0.00 sys +
2.27 cusr 0.39 csys = 2.66 CPU)
[10:32:20] t/002_types.pl .. ok 3663 ms ( 0.00 usr 0.00 sys +
2.30 cusr 0.38 csys = 2.68 CPU)
[10:32:33] t/002_types.pl .. ok 3627 ms ( 0.00 usr 0.00 sys +
2.27 cusr 0.32 csys = 2.59 CPU)
[10:32:45] t/002_types.pl .. ok 3808 ms ( 0.00 usr 0.00 sys +
2.41 cusr 0.39 csys = 2.80 CPU)
[10:32:59] t/002_types.pl .. ok 4536 ms ( 0.00 usr 0.00 sys +
2.24 cusr 0.38 csys = 2.62 CPU)
[10:33:13] t/002_types.pl .. ok 3638 ms ( 0.00 usr 0.00 sys +
2.25 cusr 0.41 csys = 2.66 CPU)
[10:33:35] t/002_types.pl .. ok 4796 ms ( 0.00 usr 0.00 sys +
2.38 cusr 0.38 csys = 2.76 CPU)
[10:33:51] t/002_types.pl .. ok 4695 ms ( 0.00 usr 0.00 sys +
2.40 cusr 0.37 csys = 2.77 CPU)
[10:34:06] t/002_types.pl .. ok 5738 ms ( 0.00 usr 0.00 sys +
2.44 cusr 0.43 csys = 2.87 CPU)

Regards,
Vignesh

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message John Naylor 2023-01-30 04:31:41 Re: [PoC] Improve dead tuple storage for lazy vacuum
Previous Message Kyotaro Horiguchi 2023-01-30 04:13:24 Re: Time delayed LR (WAS Re: logical replication restrictions)