Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: Lakshmi Narayanan Sreethar <lakshmi(at)timescale(dot)com>, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Date: 2023-01-16 09:32:16
Message-ID: CAA4eK1L5c+ZcK72evGxodq3zLje=Qv-t2Qi1GcAmKxnm5SQhYQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Sat, Jan 14, 2023 at 9:32 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
>
> The problem is here:
>
> On 2023-01-13 20:53:49 +0530, Lakshmi Narayanan Sreethar wrote:
> > #7 0x0000559cccbe1e71 in LogicalRepSyncTableStart
> > (origin_startpos=0x7fffb26f7728) at
> > /pg15.1/src/backend/replication/logical/tablesync.c:1353
>
> Because the logical rep code explicitly prevents interrupts:
>
> /*
> * Create a new permanent logical decoding slot. This slot will be used
> * for the catchup phase after COPY is done, so tell it to use the
> * snapshot to make the final data consistent.
> *
> * Prevent cancel/die interrupts while creating slot here because it is
> * possible that before the server finishes this command, a concurrent
> * drop subscription happens which would complete without removing this
> * slot leading to a dangling slot on the server.
> */
> HOLD_INTERRUPTS();
> walrcv_create_slot(LogRepWorkerWalRcvConn,
> slotname, false /* permanent */ , false /* two_phase */ ,
> CRS_USE_SNAPSHOT, origin_startpos);
> RESUME_INTERRUPTS();
>
> Which is just completely entirely wrong. Independent of this issue even. Not
> allowing termination for the duration of command executed over network?
>
> This is from:
>
> commit 6b67d72b604cb913e39324b81b61ab194d94cba0
> Author: Amit Kapila <akapila(at)postgresql(dot)org>
> Date: 2021-03-17 08:15:12 +0530
>
> Fix race condition in drop subscription's handling of tablesync slots.
>
> Commit ce0fdbfe97 made tablesync slots permanent and allow Drop
> Subscription to drop such slots. However, it is possible that before
> tablesync worker could get the acknowledgment of slot creation, drop
> subscription stops it and that can lead to a dangling slot on the
> publisher. Prevent cancel/die interrupts while creating a slot in the
> tablesync worker.
>
> Reported-by: Thomas Munro as per buildfarm
> Author: Amit Kapila
> Reviewed-by: Vignesh C, Takamichi Osumi
> Discussion: https://postgr.es/m/CA+hUKGJG9dWpw1cOQ2nzWU8PHjm=PTraB+KgE5648K9nTfwvxg@mail.gmail.com
>
>
> But this can't be the right fix.
>

I will look into this and your suggestion in a later email.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mats Kindahl 2023-01-16 10:09:47 Re: Crash during backend start when low on memory
Previous Message Andres Freund 2023-01-14 17:20:22 Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1