Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1

From: Andres Freund <andres(at)anarazel(dot)de>
To: Lakshmi Narayanan Sreethar <lakshmi(at)timescale(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Date: 2023-01-14 17:20:22
Message-ID: 20230114172022.3oy77jhzippyupgx@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

On 2023-01-14 08:02:01 -0800, Andres Freund wrote:
> Because the logical rep code explicitly prevents interrupts:
>
> /*
> * Create a new permanent logical decoding slot. This slot will be used
> * for the catchup phase after COPY is done, so tell it to use the
> * snapshot to make the final data consistent.
> *
> * Prevent cancel/die interrupts while creating slot here because it is
> * possible that before the server finishes this command, a concurrent
> * drop subscription happens which would complete without removing this
> * slot leading to a dangling slot on the server.
> */
> HOLD_INTERRUPTS();
> walrcv_create_slot(LogRepWorkerWalRcvConn,
> slotname, false /* permanent */ , false /* two_phase */ ,
> CRS_USE_SNAPSHOT, origin_startpos);
> RESUME_INTERRUPTS();
>
> Which is just completely entirely wrong. Independent of this issue even. Not
> allowing termination for the duration of command executed over network?
>
> This is from:
>
> commit 6b67d72b604cb913e39324b81b61ab194d94cba0
> Author: Amit Kapila <akapila(at)postgresql(dot)org>
> Date: 2021-03-17 08:15:12 +0530
>
> Fix race condition in drop subscription's handling of tablesync slots.
>
> Commit ce0fdbfe97 made tablesync slots permanent and allow Drop
> Subscription to drop such slots. However, it is possible that before
> tablesync worker could get the acknowledgment of slot creation, drop
> subscription stops it and that can lead to a dangling slot on the
> publisher. Prevent cancel/die interrupts while creating a slot in the
> tablesync worker.
>
> Reported-by: Thomas Munro as per buildfarm
> Author: Amit Kapila
> Reviewed-by: Vignesh C, Takamichi Osumi
> Discussion: https://postgr.es/m/CA+hUKGJG9dWpw1cOQ2nzWU8PHjm=PTraB+KgE5648K9nTfwvxg@mail.gmail.com
>
>
> But this can't be the right fix.

I wonder if we ought to put a
Assert(InterruptHoldoffCount == 0 && CritSectionCount == 0)

in some of the routines in libpqwalreceiver to protect against issues like
this. It'd be easy enough to introduce one accidentally, due to holding an
lwlock.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Kapila 2023-01-16 09:32:16 Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1
Previous Message Tom Lane 2023-01-14 16:09:37 Re: BUG #17739: postgres ts_headline function is not returning matches it should during full text search