Unable to start replica after failover

From: "Lahnov, Igor" <Igor(dot)Lahnov(at)nexign(dot)com>
To: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Cc: "Timonin, Alexey" <Alexey(dot)Timonin(at)nexign(dot)com>, "Dmitriyev, Andrey" <Andrey(dot)Dmitriyev(at)nexign(dot)com>, "Vassiliev, Andrey" <Andrey(dot)Vassiliev(at)nexign(dot)com>
Subject: Unable to start replica after failover
Date: 2022-07-29 15:01:44
Message-ID: c0085e0f801d4bda9e50b46c496311b3@nexign.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,
We're using asynchronous streaming replication together with WAL archiving and recovery (we've set these parameters: archive_command, restore_command, recovery_target_timeline=latest).
To manage backup and recovery, we're using pg_probackup.
We have a 3-node cluster managed by Patroni on Raft.
While running our cluster, we encounter a recurring problem: after a failover, the former leader cannot recover neither via the tools provided by PostgreSQL nor with the Patroni tools (pg_rewind).
We've seen a number of different WAL reading errors. For example:
* "could not find previous WAL record at E6F/C2436F50: invalid resource manager ID 139 at E6F/C2436F50"; or
* "could not find previous WAL record at 54E/FB348118: unexpected pageaddr 54E/7B34A000 in log segment 000000050000054E000000FB, offset 3448832".
There have been some other variations but all such errors have the same cause:
1. At startup, the former leader does not immediately get access to the backup. As a result, the recovery process uses the local WAL, which inevitably leads to a split brain since in most cases there is a lag in replication after a failover.
2. Next, the former leader finally manages to read the latest log from the backup, and that log has a higher priority than the local log, even though the recovery with the local log has already occurred.
As a result, we see the following:
"new timeline 3 forked off current database system timeline 2 before current recovery point 54E/FB348150".
When pg_rewind is run, it also uses the log from the backup (the lagging log from the new leader) instead of the partial log with which the former leader has already been started.
Next, pg_rewind returns errors while reading the log from the backup back, looking for the last checkpoint, which is quite reasonable because, once a new leader starts, the point of divergence normally ends up in the next timeline and the previous timeline's backup log does not have a block with the LSN of the divergence.
Also, in this case, it is impossible to build a map of the pages that have been changed to roll back the changes using pg_rewind.

To avoid the problem, we decided to stop using restore_command. Could you please let us know if there is a better solution to the problem we've described?

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Bryn Llewellyn 2022-07-29 17:10:46 Re: « The PL/pgSQL interpreter parses the function's source text and produces an internal binary instruction tree... »
Previous Message Gianni Ceccarelli 2022-07-29 09:05:18 Re: Syntax error when combining --set and --command has me stumped