Re: sync_standbys_defined read/write race on startup

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: "Maksim(dot)Melnikov" <m(dot)melnikov(at)postgrespro(dot)ru>
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: sync_standbys_defined read/write race on startup
Date: 2025-03-26 22:13:08
Message-ID: Z-R79EJVYfRy-EpW@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 26, 2025 at 05:17:17PM +0300, Maksim.Melnikov wrote:
> Test check visibility of prepared transactions in standby after a restart
> while primary is down.
> Failed assert check that changes, commited on transaction on primary node,
> were synchronously replicated to standby node.
> In test flow we have primary server stop/start events.
> Assert failure is hard to reproduce, but I found how to do it synthetically,
> can do it by injecting in code synthetical sleep call
>
> --- a/src/backend/replication/syncrep.c
> +++ b/src/backend/replication/syncrep.c
> @@ -924,6 +924,7 @@ SyncRepUpdateSyncStandbysDefined(void)
>
>         if (sync_standbys_defined != WalSndCtl->sync_standbys_defined)
>         {
> +               sleep(2);
>                 LWLockAcquire(SyncRepLock, LW_EXCLUSIVE);
>
>                 /*

Nice investigation to find out that this specific path faces a race
here. Tests on Windows are usually slower and good at catching these.
It may be possible that the buildfarm has reported that, actually.

> So my question is
> can we accept such behavior as bug of postgres codebase, or, on other way,
> it is bug of unstable test?
> P.S.
> For me it is looking strange, that postgres start processing workload before
> full config initializing.
> Attached possible patch for codebase.

Yeah, I agree that it is a bug, and your approach to be more careful
in SyncRepWaitForLSN() so as we wait until the configuration is loaded
and set from the checkpointer sounds like the sensible thing to do,
because the backends have no idea yet what they should do as the
configuration is not loaded.

Tracking if the configuration has been properly loaded in
WalSndCtlData makes sense here, but I am confused by the patch: you
have defined SYNCSTANDBYSDEFINED but sync_standbys_defined never sets
it. It may be simpler to use a separate boolean flag for this
purpose. WalSndCtlData is a private structure holding the state of
the WAL senders internally, so ABI does not stress me much here
(flexible array at the end of the structure, as well..).

Also mentioned by Kirill downthread: let's add a regression test with
an injection point that does a wait. This race condition is worth
having a test for. You could define the point of where you have added
your hardcoded sleep(), for example.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Michael Paquier 2025-03-26 22:15:19 Re: Fix infinite loop from setting scram_iterations to INT_MAX
Previous Message Rafael Thofehrn Castro 2025-03-26 21:57:54 Re: Proposal: Progressive explain