Allow users to choose what happens when recovery target is not reached

From: Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
To: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Allow users to choose what happens when recovery target is not reached
Date: 2021-11-12 10:14:00
Message-ID: CALj2ACWR4iaph7AWCr5-V9dXqpf2p5B=3fTyvLfL8VD_E+x0tA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Currently, the server shuts down with a FATAL error (added by commit
[1]) when the recovery target isn't reached. This can cause a server
availability problem, especially in case of disaster recovery (geo
restores) where the primary was down and the user is doing a PITR on a
server lying in another region where it had missed to receive few of
the last WAL files required to reach the recovery target. In this
case, users might want the server to be available rather than a no
server. With the commit [1], there's no way to achieve what users
wanted.

There can be many reasons for the last few WAL files not reaching the
target server where the user is performing the PITR. The primary may
have been down before archiving the last few WAL files to the archive
locations, or archive command fails for whatever reasons or network
latency from primary to archive location and archive location to the
target server, or recovery command on the target server fails or users
may have chosen some wrong/futuristic recovery targets etc. If the
PITR fails with FATAL error and we may ask them to restart the server,
but imagine the wastage of compute resources - if there are a 1 TB of
WAL files to be replayed and just last 16MB WAL file is missing,
everything has to be replayed from the beginning.

Here's a proposal(and a patch) to have a GUC so that users can choose
either to emit a warning and promote or shutdown with FATAL error (as
default) when recovery target isn't reached. In reality, users can
choose to shutdown with FATAL error, if strict consistency is the
necessity, otherwise they can choose to get promoted, if availability
is preferred. There is some discussion around this idea in [2].

Thoughts?

[1] - commit dc788668bb269b10a108e87d14fefd1b9301b793
Author: Peter Eisentraut <peter(at)eisentraut(dot)org>
Date: Wed Jan 29 15:43:32 2020 +0100

Fail if recovery target is not reached

Before, if a recovery target is configured, but the archive ended
before the target was reached, recovery would end and the server would
promote without further notice. That was deemed to be pretty wrong.
With this change, if the recovery target is not reached, it is a fatal
error.

Based-on-patch-by: Leif Gunnar Erlandsen <leif(at)lako(dot)no>
Reviewed-by: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Discussion:
https://www.postgresql.org/message-id/flat/993736dd3f1713ec1f63fc3b653839f5(at)lako(dot)no

[2] - https://www.postgresql.org/message-id/b334d61396e6b0657a63dc38e16d429703fe9b96.camel%40j-davis.com

Regards,
Bharath Rupireddy.

Attachment Content-Type Size
v1-0001-Allow-users-to-choose-what-happens-when-recovery-.patch application/octet-stream 10.6 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bharath Rupireddy 2021-11-12 10:17:46 Re: add retry mechanism for achieving recovery target before emitting FATA error "recovery ended before configured recovery target was reached"
Previous Message tanghy.fnst@fujitsu.com 2021-11-12 09:22:11 RE: Logical replication timeout problem