From: | Peter Smith <smithpb2250(at)gmail(dot)com> |
---|---|
To: | "houzj(dot)fnst(at)fujitsu(dot)com" <houzj(dot)fnst(at)fujitsu(dot)com> |
Cc: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, vignesh C <vignesh21(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Deadlock between logrep apply worker and tablesync worker |
Date: | 2023-02-08 19:08:27 |
Message-ID: | CAHut+PuQSQjL_Wd_-8GncNjzX_qMiCg_hP2OTZ1gY4GtQ3tNqQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, Feb 3, 2023 at 6:58 PM houzj(dot)fnst(at)fujitsu(dot)com
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Thursday, February 2, 2023 7:21 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Thu, Feb 2, 2023 at 12:05 PM houzj(dot)fnst(at)fujitsu(dot)com
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > On Tuesday, January 31, 2023 1:07 AM vignesh C <vignesh21(at)gmail(dot)com>
> > wrote:
> > > > On Mon, 30 Jan 2023 at 17:30, vignesh C <vignesh21(at)gmail(dot)com> wrote:
> > > >
> > >
> > > I also tried to test the time of "src/test/subscription/t/002_types.pl"
> > > before and after the patch(change the lock level) and Tom's
> > > patch(split
> > > transaction) like what Vignesh has shared on -hackers.
> > >
> > > I run about 100 times for each case. Tom's and the lock level patch
> > > behave similarly on my machines[1].
> > >
> > > HEAD: 3426 ~ 6425 ms
> > > HEAD + Tom: 3404 ~ 3462 ms
> > > HEAD + Vignesh: 3419 ~ 3474 ms
> > > HEAD + Tom + Vignesh: 3408 ~ 3454 ms
> > >
> > > Even apart from the testing time reduction, reducing the lock level
> > > and lock the specific object can also help improve the lock contention
> > > which user(that use the exposed function) , table sync worker and
> > > apply worker can also benefit from it. So, I think pushing the patch to change
> > the lock level makes sense.
> > >
> > > And the patch looks good to me.
> > >
> >
> > Thanks for the tests. I also see a reduction in test time variability with Vignesh's
> > patch. I think we can release the locks in case the origin is concurrently
> > dropped as in the attached patch. I am planning to commit this patch
> > tomorrow unless there are more comments or objections.
> >
> > > While on it, after pushing the patch, I think there is another case
> > > might also worth to be improved, that is the table sync and apply
> > > worker try to drop the same origin which might cause some delay. This
> > > is another case(different from the deadlock), so I feel we can try to improve
> > this in another patch.
> > >
> >
> > Right, I think that case could be addressed by Tom's patch to some extent but
> > I am thinking we should also try to analyze if we can completely avoid the need
> > to remove origins from both processes. One idea could be to introduce
> > another relstate something like PRE_SYNCDONE and set it in a separate
> > transaction before we set the state as SYNCDONE and remove the slot and
> > origin in tablesync worker.
> > Now, if the tablesync worker errors out due to some reason during the second
> > transaction, it can remove the slot and origin after restart by checking the state.
> > However, it would add another relstate which may not be the best way to
> > address this problem. Anyway, that can be accomplished as a separate patch.
>
> Here is an attempt to achieve the same.
> Basically, the patch removes the code that drop the origin in apply worker. And
> add a new state PRE_SYNCDONE after synchronization finished in front of apply
> (sublsn set), but before dropping the origin and other final cleanups. The
> tablesync will restart and redo the cleanup if it failed after reaching the new
> state. Besides, since the changes can already be applied on the table in
> PRE_SYNCDONE state, so I also modified the check in
> should_apply_changes_for_rel(). And some other conditions for the origin drop
> in subscription commands are were adjusted in this patch.
>
BTW, the tablesync.c has a large file header comment which describes
all about the relstates including some examples. So this patch needs
to include modifications to that comment.
------
Kind Regards,
Peter Smith.
Fujitsu Australia.
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2023-02-08 20:02:35 | Re: Logical replication timeout problem |
Previous Message | Nathan Bossart | 2023-02-08 18:55:44 | Re: recovery modules |