Re: Assertion failure in WaitForWALToBecomeAvailable state machine

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Dilip Kumar <dilipbalaut(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Assertion failure in WaitForWALToBecomeAvailable state machine
Date: 2022-09-13 06:26:16
Message-ID: CALj2ACWCk2it_A=ubKc6r3jt9N3WrfGsD7FWD_YCF2AYyraD+w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Sep 13, 2022 at 8:52 AM Noah Misch <noah(at)leadboat(dot)com> wrote:
>
> > > > [1] - https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com
>
> I plan to use that message's patch, because it guarantees WALRCV_STOPPED at
> the code location being changed. Today, in the unlikely event of
> !WalRcvStreaming() due to WALRCV_WAITING or WALRCV_STOPPING, that code
> proceeds without waiting for WALRCV_STOPPED.

Hm. That was the original fix [2] proposed and it works. The concern
is that XLogShutdownWalRcv() does a bunch of work via ShutdownWalRcv()
- it calls ConditionVariablePrepareToSleep(),
ConditionVariableCancelSleep() (has lock 2 acquisitions and
requisitions) and 1 function call WalRcvRunning()) even for
WALRCV_STOPPED case, all this is unnecessary IMO when we determine the
walreceiver is state is already WALRCV_STOPPED. I think we can do
something like [3] to allay this concern and go with the fix proposed
at [1] unconditionally calling XLogShutdownWalRcv().

> If WALRCV_WAITING or WALRCV_STOPPING can happen at that patch's code site, I
> perhaps should back-patch the change to released versions. Does anyone know
> whether one or both can happen?

IMO, we must back-patch to the version where
cc2c7d65fc27e877c9f407587b0b92d46cd6dd16 got introduced irrespective
of any of the above happening.

Thoughts?

[1] - https://www.postgresql.org/message-id/flat/20220909.172949.2223165886970819060.horikyota.ntt%40gmail.com
[2] - https://www.postgresql.org/message-id/flat/CALj2ACUoBWbaFo_t0gew%2Bx6n0V%2BmpvB_23HLvsVD9abgCShV5A%40mail.gmail.com#e762ee94cbbe32a0b8c72c60793626b3
[3] -diff --git a/src/backend/replication/walreceiverfuncs.c
b/src/backend/replication/walreceiverfuncs.c
index 90798b9d53..3487793c7a 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -181,6 +181,7 @@ ShutdownWalRcv(void)
WalRcvData *walrcv = WalRcv;
pid_t walrcvpid = 0;
bool stopped = false;
+ bool was_stopped = false;

/*
* Request walreceiver to stop. Walreceiver will switch to
WALRCV_STOPPED
@@ -191,6 +192,7 @@ ShutdownWalRcv(void)
switch (walrcv->walRcvState)
{
case WALRCV_STOPPED:
+ was_stopped = true;
break;
case WALRCV_STARTING:
walrcv->walRcvState = WALRCV_STOPPED;
@@ -208,6 +210,10 @@ ShutdownWalRcv(void)
}
SpinLockRelease(&walrcv->mutex);

+ /* Quick exit if walreceiver was already stopped. */
+ if (was_stopped)
+ return;
+
/* Unnecessary but consistent. */
if (stopped)
ConditionVariableBroadcast(&walrcv->walRcvStoppedCV);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2022-09-13 06:32:43 Re: postgres_fdw hint messages
Previous Message Kyotaro Horiguchi 2022-09-13 06:22:37 Re: Error "initial slot snapshot too large" in create replication slot