Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Date: 2023-11-22 10:38:49
Message-ID: CAA4eK1Kj6THmgXZ98d8rxAJa+V1KGS79KgqVve5i0Oyn+=tbMw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Nov 21, 2023 at 6:56 PM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
>
> On 11/21/23 14:16, Amit Kapila wrote:
> > On Tue, Nov 21, 2023 at 5:17 PM Tomas Vondra
> > <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> >>
> >
> > It seems there is some inconsistency in what you have written for
> > client backends/tablesync worker vs. apply worker. The above text
> > seems to be saying that the client backend and table sync worker are
> > waiting on a "subscription row in pg_subscription" and the apply
> > worker is operating on "pg_subscription_rel". So, if that is true then
> > they shouldn't get stuck.
> >
> > I think here client backend and tablesync worker seems to be blocked
> > for a lock on pg_subscription_rel.
> >
>
> Not really, they are all locking the subscription. All the locks are on
> classid=6100, which is pg_subscription:
>
> test=# select 6100::regclass;
> regclass
> -----------------
> pg_subscription
> (1 row)
>
> The thing is, the tablesync workers call UpdateSubscriptionRelState,
> which locks the pg_subscription catalog at the very beginning:
>
> LockSharedObject(SubscriptionRelationId, ...);
>
> So that's the issue. I haven't explored why it's done this way, and
> there's no comment explaining locking the subscriptions is needed ...
>

I think it prevents concurrent drop of rel during the REFRESH operation.

> >> The tablesync workers can't proceed because their lock request is stuck
> >> behind the AccessExclusiveLock request.
> >>
> >> And the apply worker can't proceed, because it's waiting for status
> >> update from the tablesync workers.
> >>
> >
> > This part is not clear to me because
> > wait_for_relation_state_change()->GetSubscriptionRelState() seems to
> > be releasing the lock while closing the relation. Am, I missing
> > something?
> >
>
> I think you're missing the fact that GetSubscriptionRelState() acquires
> and releases the lock on pg_subscription_rel, but that's not the lock
> causing the issue. The problem is the lock on the pg_subscription row.
>

Okay. IIUC, what's going on here is that the apply worker acquires
AccessShareLock on pg_subscription to update rel state for one of the
tables say tbl-1, and then for another table say tbl-2, it started
waiting for a state change via wait_for_relation_state_change(). I
think here the fix is to commit the transaction before we go for a
wait. I guess we need something along the lines of what is proposed in
[1] though we have solved the problem in that thread in some other
way..

[1] - https://www.postgresql.org/message-id/1412708.1674417574%40sss.pgh.pa.us

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tomas Vondra 2023-11-22 11:21:25 Re: undetected deadlock in ALTER SUBSCRIPTION ... REFRESH PUBLICATION
Previous Message Давыдов Виталий 2023-11-22 10:29:30 Re: How to accurately determine when a relation should use local buffers?