From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Lakshmi Narayanan Sreethar <lakshmi(at)timescale(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Cc: | pgsql-bugs(at)lists(dot)postgresql(dot)org |
Subject: | Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1 |
Date: | 2023-01-14 17:20:22 |
Message-ID: | 20230114172022.3oy77jhzippyupgx@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
On 2023-01-14 08:02:01 -0800, Andres Freund wrote:
> Because the logical rep code explicitly prevents interrupts:
>
> /*
> * Create a new permanent logical decoding slot. This slot will be used
> * for the catchup phase after COPY is done, so tell it to use the
> * snapshot to make the final data consistent.
> *
> * Prevent cancel/die interrupts while creating slot here because it is
> * possible that before the server finishes this command, a concurrent
> * drop subscription happens which would complete without removing this
> * slot leading to a dangling slot on the server.
> */
> HOLD_INTERRUPTS();
> walrcv_create_slot(LogRepWorkerWalRcvConn,
> slotname, false /* permanent */ , false /* two_phase */ ,
> CRS_USE_SNAPSHOT, origin_startpos);
> RESUME_INTERRUPTS();
>
> Which is just completely entirely wrong. Independent of this issue even. Not
> allowing termination for the duration of command executed over network?
>
> This is from:
>
> commit 6b67d72b604cb913e39324b81b61ab194d94cba0
> Author: Amit Kapila <akapila(at)postgresql(dot)org>
> Date: 2021-03-17 08:15:12 +0530
>
> Fix race condition in drop subscription's handling of tablesync slots.
>
> Commit ce0fdbfe97 made tablesync slots permanent and allow Drop
> Subscription to drop such slots. However, it is possible that before
> tablesync worker could get the acknowledgment of slot creation, drop
> subscription stops it and that can lead to a dangling slot on the
> publisher. Prevent cancel/die interrupts while creating a slot in the
> tablesync worker.
>
> Reported-by: Thomas Munro as per buildfarm
> Author: Amit Kapila
> Reviewed-by: Vignesh C, Takamichi Osumi
> Discussion: https://postgr.es/m/CA+hUKGJG9dWpw1cOQ2nzWU8PHjm=PTraB+KgE5648K9nTfwvxg@mail.gmail.com
>
>
> But this can't be the right fix.
I wonder if we ought to put a
Assert(InterruptHoldoffCount == 0 && CritSectionCount == 0)
in some of the routines in libpqwalreceiver to protect against issues like
this. It'd be easy enough to introduce one accidentally, due to holding an
lwlock.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2023-01-16 09:32:16 | Re: DROP DATABASE deadlocks with logical replication worker in PG 15.1 |
Previous Message | Tom Lane | 2023-01-14 16:09:37 | Re: BUG #17739: postgres ts_headline function is not returning matches it should during full text search |