Re: Get stuck when dropping a subscription during synchronizing table

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Peter Eisentraut <peter(dot)eisentraut(at)2ndquadrant(dot)com>, tushar <tushar(dot)ahuja(at)enterprisedb(dot)com>, Robert Haas <robertmhaas(at)gmail(dot)com>, Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Get stuck when dropping a subscription during synchronizing table
Date: 2017-07-03 02:31:18
Message-ID: CAD21AoCXTnTbHLrHPqshBM399ck+oGMg5PD-usK-5s+kQEv39g@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jun 28, 2017 at 2:13 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> On Wed, Jun 28, 2017 at 1:47 AM, Petr Jelinek
> <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>>
>> On 27/06/17 10:51, Masahiko Sawada wrote:
>>> On Mon, Jun 26, 2017 at 12:12 PM, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>>>
>>> I've reviewed this patch briefly.
>>
>> Thanks!
>>
>>>
>>> @@ -515,6 +533,31 @@ logicalrep_worker_stop(Oid subid, Oid relid)
>>> }
>>>
>>> /*
>>> + * Request worker to be stopped on commit.
>>> + */
>>> +void
>>> +logicalrep_worker_stop_at_commit(Oid subid, Oid relid)
>>> +{
>>> + LogicalRepWorkerId *wid;
>>> + MemoryContext old;
>>> +
>>> + old = MemoryContextSwitchTo(TopTransactionContext);
>>> +
>>> + wid = (LogicalRepWorkerId *) palloc(sizeof(LogicalRepWorkerId));
>>> +
>>> + /*
>>> + wid = MemoryContextAlloc(TopTransactionContext,
>>> +
>>> sizeof(LogicalRepWorkerId));
>>> + */
>>> + wid->subid = subid;
>>> + wid->relid = relid;
>>> +
>>> + on_commit_stop_workers = lappend(on_commit_stop_workers, wid);
>>> +
>>> + MemoryContextSwitchTo(old);
>>> +}
>>>
>>> logicalrep_worker_stop_at_commit() has a problem that new_list()
>>> called by lappend() allocates the memory from current memory context.
>>> It should switch the memory context and then allocate the memory for
>>> wid and append it to the list.
>>>
>
> Thank you for updating the patch!
>
>>
>> Right, fixed (I see you did that locally as well based on the above
>> excerpt ;) ).
>
> Oops, yeah that's my test code.
>
>>> --------
>>> @@ -754,9 +773,25 @@ ApplyLauncherShmemInit(void)
>>> void
>>> AtEOXact_ApplyLauncher(bool isCommit)
>>> {
>>> - if (isCommit && on_commit_launcher_wakeup)
>>> - ApplyLauncherWakeup();
>>> + ListCell *lc;
>>> +
>>> + if (isCommit)
>>> + {
>>> + foreach (lc, on_commit_stop_workers)
>>> + {
>>> + LogicalRepWorkerId *wid = lfirst(lc);
>>> + logicalrep_worker_stop(wid->subid, wid->relid);
>>> + }
>>> +
>>> + if (on_commit_launcher_wakeup)
>>> + ApplyLauncherWakeup();
>>>
>>> Stopping the logical rep worker in AtEOXact_ApplyLauncher doesn't
>>> support the prepared transaction. Since we allocate the list
>>> on_commit_stop_workers in TopTransactionContext the postgres crashes
>>> if we execute any query after prepared transaction that removes
>>> subscription relation state. Also after fixed this issue, we still
>>> need to something: the list of on_commit_stop_workers is not
>>> associated the prepared transaction. A query next to "preapre
>>> transaction" tries to stop workers at the commit. There was similar
>>> discussion before.
>>>
>>
>> Hmm, good point. I think for now it makes sense to simply don't allow
>> PREPARE for transactions that manipulate workers similar to what we do
>> when there are exported snapshots. Done it that way in attached.
>
> Agreed. Should we note that in docs?
>
>>
>>> --------
>>> +
>>> + ensure_transaction();
>>> +
>>> UpdateSubscriptionRelState(MyLogicalRepWorker->subid,
>>> +
>>> rstate->relid, rstate->state,
>>> +
>>> rstate->lsn);
>>>
>>>
>>> Should we commit the transaction if we started a new transaction
>>> before update the subscription relation state, or it could be deadlock
>>> risk.
>>
>> We only lock the whole subscription (and only conflicting things are
>> DROP and ALTER SUBSCRIPTION), not individual subscription-relation
>> mapping so it doesn't seem to me like there is any higher risk of
>> deadlocks than anything else which works with multiple tables (or
>> compared to previous behavior).
>>
>
> I'm concerned that a lock for whole subscription could be conflicted
> between ALTER SUBSCRIPTION, table sync worker and apply worker:
>
> Please imagine the following case.
>
> 1. The apply worker updates a subscription relation state R1 of
> subscription S1.
> -> It acquires AccessShareLock on S1, and keep holding.
> 2. ALTER SUBSCRIPTION SET PUBLICATION tries to acquire
> AccessExclusiveLock on S1.
> -> But it waits for the apply worker to release the lock.
> 3. The apply worker calls wait_for_relation_state_change(relid,
> SUBREL_STATE_SYNCDONE) and waits for the table sync worker
> for R2 to change its status.
> -> Note that the apply worker is still holding AccessShareLock on S1
> 4. The table sync worker tries to update its status to SYNCDONE
> -> In UpdateSubscriptionRelState(), it tires to acquire AccessShareLock
> on S1 but waits for it. *deadlock*
>
> To summary, because the apply worker keeps holding AccessShareLock on
> S1, the acquiring AccessExclusiveLock by ALTER SUBSCRIPTION waits for
> the apply worker, and then the table sync worker also waits for the
> ALTER SUBSCRIPTION in order to change its status. And the apply worker
> waits for the table sync worker to change its status.
>
> I encountered the similar case once on my environment. But since it
> completely depends on timing I don't have the concrete reproducing
> steps.
>

Also, now the patch conflicts with current HEAD, so need to be updated.

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Langote 2017-07-03 02:32:15 Re: Multi column range partition table
Previous Message Tom Lane 2017-07-03 02:27:54 Re: Race-like failure in recovery/t/009_twophase.pl