From: | Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> |
---|---|
To: | Fujii Masao <masao(dot)fujii(at)gmail(dot)com> |
Cc: | Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Cannot shutdown subscriber after DROP SUBSCRIPTION |
Date: | 2017-02-06 17:13:12 |
Message-ID: | b95a2aa4-11a0-d211-1e9b-6545261b265a@2ndquadrant.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 06/02/17 17:33, Fujii Masao wrote:
> On Sun, Feb 5, 2017 at 5:11 AM, Petr Jelinek
> <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>> On 03/02/17 19:38, Fujii Masao wrote:
>>> On Sat, Feb 4, 2017 at 12:49 AM, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote:
>>>> On Fri, Feb 3, 2017 at 10:56 AM, Kyotaro HORIGUCHI
>>>> <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> wrote:
>>>>> At Fri, 3 Feb 2017 01:02:47 +0900, Fujii Masao <masao(dot)fujii(at)gmail(dot)com> wrote in <CAHGQGwHqQVHmQ7wM=eLNnp1_oy-GVSSAcaJXWjE4nc2twSqXRg(at)mail(dot)gmail(dot)com>
>>>>>> On Thu, Feb 2, 2017 at 2:36 PM, Michael Paquier
>>>>>> <michael(dot)paquier(at)gmail(dot)com> wrote:
>>>>>>> On Thu, Feb 2, 2017 at 2:13 PM, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:
>>>>>>>> Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp> writes:
>>>>>>>>> Then, the reason for the TRY-CATCH cluase is that I found that
>>>>>>>>> some functions called from there can throw exceptions.
>>>>>>>>
>>>>>>>> Yes, but all LWLocks should be released by normal error recovery.
>>>>>>>> It should not be necessary for this code to clean that up by hand.
>>>>>>>> If it were necessary, there would be TRY-CATCH around every single
>>>>>>>> LWLockAcquire in the backend, and we'd have an unreadable and
>>>>>>>> unmaintainable system. Please don't add a TRY-CATCH unless it's
>>>>>>>> *necessary* -- and you haven't explained why this one is.
>>>>>>
>>>>>> Yes.
>>>>>
>>>>> Thank you for the suggestion. I minunderstood that.
>>>>>
>>>>>>> Putting hands into the code and at the problem, I can see that
>>>>>>> dropping a subscription on a node makes it unresponsive in case of a
>>>>>>> stop. And that's just because calls to LWLockRelease are missing as in
>>>>>>> the patch attached. A try/catch problem should not be necessary.
>>>>>>
>>>>>> Thanks for the patch!
>>>>>>
>>>>>> With the patch, LogicalRepLauncherLock is released at the end of
>>>>>> DropSubscription(). But ISTM that the lock should be released just after
>>>>>> logicalrep_worker_stop() and there is no need to protect the removal of
>>>>>> replication slot with the lock.
>>>>>
>>>>> That's true. logicalrep_worker_stop returns after confirmig that
>>>>> worker->proc is cleard, so no false relaunch cannot be caused.
>>>>> After all, logicalrep_worker_stop is surrounded by
>>>>> LWLockAcquire/Relase pair. So it can be moved into the funciton
>>>>> and make the lock secrion to be more narrower.
>>>
>>> If we do this, Assert(LWLockHeldByMe(LogicalRepLauncherLock)) should be
>>> removed and the comment for logicalrep_worker_stop() should be updated.
>>>
>>> Your approach may cause the deadlock. The launcher takes LogicalRepWorkerLock
>>> while holding LogicalRepLauncherLock. OTOH, with your approach,
>>> logicalrep_worker_stop() takes LogicalRepLauncherLock while holding
>>> LogicalRepWorkerLock.
>>>
>>> Therefore I pushed the simple patch which adds LWLockRelease() just after
>>> logicalrep_worker_stop().
>>>
>>> Another problem that I found while reading the code is that the launcher can
>>> start up the worker with the subscription that DROP SUBSCRIPTION just removed.
>>> That is, DROP SUBSCRIPTION removes the target entry from pg_subscription,
>>> but the launcher can see it and start new worker until the transaction for
>>> DROP has been committed.
>>>
>>
>> That was the reason why DropSubscription didn't release the lock in the
>> first place. It was supposed to be released at the end of the
>> transaction though.
>
> OK, I understood why you used the lock in that way. But using LWLock
> for that purpose is not valid.
>
Yeah, I just tried to avoid what we are doing now really hard :)
>>> To fix this issue, I think that DROP SUBSCRIPTION should take
>>> AccessExclusiveLock on pg_subscription, instead of RowExclusiveLock,
>>> so that the launcher cannot see the entry to be being removed.
>>>
>>
>> The whole point of having LogicalRepLauncherLock is to avoid having to
>> do this, so if we do this we could probably get rid of it.
>
> Yes, let's remove LogicalRepLauncherLock and lock pg_subscription
> with AccessExclusive mode at the beginning of DROP SUBSCRIPTION.
> Attached patch does this.
>
Okay, looks reasonable to me.
--
Petr Jelinek http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Alvaro Herrera | 2017-02-06 17:14:36 | Re: Index corruption with CREATE INDEX CONCURRENTLY |
Previous Message | Petr Jelinek | 2017-02-06 17:07:40 | Re: IF [NOT] EXISTS for replication slots |