Re: speed up a logical replica setup

From: "Euler Taveira" <euler(at)eulerto(dot)com>
To: "kuroda(dot)hayato(at)fujitsu(dot)com" <kuroda(dot)hayato(at)fujitsu(dot)com>, "Peter Eisentraut" <peter(at)eisentraut(dot)org>, "Bharath Rupireddy" <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Cc: "Amit Kapila" <amit(dot)kapila16(at)gmail(dot)com>, "Shlok Kyal" <shlok(dot)kyal(dot)oss(at)gmail(dot)com>, "Tomas Vondra" <tomas(dot)vondra(at)enterprisedb(dot)com>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>, "Michael Paquier" <michael(at)paquier(dot)xyz>, "Andres Freund" <andres(at)anarazel(dot)de>, "Ashutosh Bapat" <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Fabrízio de Royes Mello <fabriziomello(at)gmail(dot)com>, "vignesh C" <vignesh21(at)gmail(dot)com>
Subject: Re: speed up a logical replica setup
Date: 2024-03-26 02:53:21
Message-ID: c8e92bcb-69c3-477e-93d4-6f39e030cab0@app.fastmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Mar 25, 2024, at 1:06 PM, Hayato Kuroda (Fujitsu) wrote:
> ## Analysis for failure 1
>
> The failure caused by a time lag between walreceiver finishes and pg_is_in_recovery()
> returns true.
>
> According to the output [1], it seems that the tool failed at wait_for_end_recovery()
> with the message "standby server disconnected from the primary". Also, lines
> "redo done at..." and "terminating walreceiver process due to administrator command"
> meant that walreceiver was requested to shut down by XLogShutdownWalRcv().
>
> According to the source, we confirm that walreceiver is shut down in
> StartupXLOG()->FinishWalRecovery()->XLogShutdownWalRcv(). Also, SharedRecoveryState
> is changed to RECOVERY_STATE_DONE (this meant the pg_is_in_recovery() return true)
> at the latter part of StartupXLOG().
>
> So, if there is a delay between FinishWalRecovery() and change the state, the check
> in wait_for_end_recovery() would be failed during the time. Since we allow to miss
> the walreceiver 10 times and it is checked once per second, the failure occurs if
> the time lag is longer than 10 seconds.
>
> I do not have a good way to fix it. One approach is make NUM_CONN_ATTEMPTS larger,
> but it's not a fundamental solution.

I was expecting that slow hosts might have issues in wait_for_end_recovery().
As you said it took a lot of steps between FinishWalRecovery() (where
walreceiver is shutdown -- XLogShutdownWalRcv) and SharedRecoveryState is set to
RECOVERY_STATE_DONE. If this window takes longer than NUM_CONN_ATTEMPTS *
WAIT_INTERVAL (10 seconds), it aborts the execution. That's a bad decision
because it already finished the promotion and it is just doing the final
preparation for the host to become a primary.

/*
* If it is still in recovery, make sure the target server is
* connected to the primary so it can receive the required WAL to
* finish the recovery process. If it is disconnected try
* NUM_CONN_ATTEMPTS in a row and bail out if not succeed.
*/
res = PQexec(conn,
"SELECT 1 FROM pg_catalog.pg_stat_wal_receiver");
if (PQntuples(res) == 0)
{
if (++count > NUM_CONN_ATTEMPTS)
{
stop_standby_server(subscriber_dir);
pg_log_error("standby server disconnected from the primary");
break;
}
}
else
count = 0; /* reset counter if it connects again */

This code was add to defend against the death/crash of the target server. There
are at least 3 options:

(1) increase NUM_CONN_ATTEMPTS * WAIT_INTERVAL seconds. We discussed this constant
and I decided to use 10 seconds because even in some slow hosts, this time
wasn't reached during my tests. It seems I forgot to test the combination of slow
host, asserts enabled, and ubsan. I didn't notice that pg_promote() uses 60
seconds as default wait. Maybe that's a reasonable value. I checked the
004_timeline_switch test and the last run took: 39.2s (serinus), 33.1s
(culicidae), 18.31s (calliphoridae) and 27.52s (olingo).

(2) check if the primary is not running when walreceiver is not available on the
target server. Increase the connection attempts iif the primary is not running.
Hence, the described case doesn't cause an increment on the count variable.

(3) set recovery_timeout default to != 0 and remove pg_stat_wal_receiver check
protection against the death/crash target server. I explained in a previous
message that timeout may occur in cases that WAL replay to reach consistent
state takes more than recovery-timeout seconds.

Option (1) is the easiest fix, however, we can have the same issue again if a
slow host decides to be even slower, hence, we have to adjust this value again.
Option (2) interprets the walreceiver absence as a recovery end and if the
primary server is running it can indicate that the target server is in the
imminence of the recovery end. Option (3) is not as resilient as the other
options.

The first patch implements a combination of (1) and (2).

> ## Analysis for failure 2
>
> According to [2], the physical replication slot which is specified as primary_slot_name
> was not used by the walsender process. At that time walsender has not existed.
>
> ```
> ...
> pg_createsubscriber: publisher: current wal senders: 0
> pg_createsubscriber: command is: SELECT 1 FROM pg_catalog.pg_replication_slots WHERE active AND slot_name = 'physical_slot'
> pg_createsubscriber: error: could not obtain replication slot information: got 0 rows, expected 1 row
> ...
> ```
>
> Currently standby must be stopped before the command and current code does not
> block the flow to ensure the replication is started. So there is a possibility
> that the checking is run before walsender is launched.
>
> One possible approach is to wait until the replication starts. Alternative one is
> to ease the condition.

That's my suggestion too. I reused NUM_CONN_ATTEMPTS (that was renamed to
NUM_ATTEMPTS in the first patch). See second patch.

--
Euler Taveira
EDB https://www.enterprisedb.com/

Attachment Content-Type Size
v1-0001-Improve-the-code-that-checks-if-the-recovery-is-f.patch text/x-patch 4.3 KB
v1-0002-Improve-the-code-that-checks-if-the-primary-slot-.patch text/x-patch 2.4 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Euler Taveira 2024-03-26 02:57:27 Re: speed up a logical replica setup
Previous Message Andres Freund 2024-03-26 02:34:47 Re: RFC: Logging plan of the running query