Re: BUG #17744: Fail Assert while recoverying from pg_basebackup

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: thomas(dot)munro(at)gmail(dot)com, andres(at)anarazel(dot)de, zxwsbg12138(at)gmail(dot)com, pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: Re: BUG #17744: Fail Assert while recoverying from pg_basebackup
Date: 2023-02-27 04:01:21
Message-ID: Y/wrERSMTIdCuy9N@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Mon, Feb 27, 2023 at 12:01:01PM +0900, Kyotaro Horiguchi wrote:
> In my understanding, the approach differs slightly from the OP's
> case. While server reaches consistency state early on using this
> steps, in the OP's case it takes much longer time to reach
> consistency, to the point where a checkpoint occurs before it is
> reached. It seems that the approach taken by the patch does not work
> in the OP's case, where (I guess that) hot-standby is disabled.

Yep, hot standby is disabled in this case. Now note that we go
through the hot-standby initialization based on
(ArchiveRecoveryRequested && EnableHotStandby). The first one is true
when we find a recovery.signal or standby.signal. The second one is
the hot_standby GUC, hence recovery.signal enforces the initialization
even if we know that we won't allow standby connections anyway. This
is a waste, surely, if you only have recovery.signal, still it does
not hurt?

> As you know, pg_basebackup has the -R option. It seems to me that the
> existence means we conciously allowed the backup to start without
> archive recovery. The OP's operation (if it is correct) involves
> taking a base backup and treating it as a literal backup, meaning it
> begins its life on the origial timeline. This change prevents an
> operation that we may have implicitly allowed (at least, that is my
> understanding).

Note that pg_basebackup -R enforces a standby.signal, not a
recovery.signal. I am questioning whether we have to support the case
of this thread, to be honest, particularly if it means simplifying the
recovery logic that's now behind the two signal files and the
backup_label. That's not a behavior that should be enforced in a
stable branch, perhaps, but if it means removing some of the boolean
flags controlling the recovery behavior, simplifying its logic, then I
see a pretty good argument with doing a move in this direction if it
eases the maintenance of this code with less cases to think about. We
have too many of them already and these have caused us many issues in
recent history (continuation records, for one), so I would draw a
clear line and simplify things.

Also, note "Recovering Using a Continuous Archive Backup", where we
basically say that one *has* to put a recovery.signal when recovering
using a base backup in one of the steps, which enforces a new timeline
once recovery is done with a base backup. What I have sent upthread
enforces that more in the presence of a backup_label file without
recovery.signal, which may be too aggressive, perhaps, because it
causes always a TLI jump, still it avoids the need of a hard failure
to always require a recovery.signal or standby.signal if we are
recovering from a base backup. If we were to support the case where
we have (InArchiveRecovery && !ArchiveRecoveryRequested) during
recovery, we'd better put a clear definition behind it. Personnally,
I don't know what it means based on what is documented in
xlogrecovery.c or xlog.c (echoing with [1]).

> Thus, I slightly object to let backup_label to enforce archive
> recovery. And rather prefer to suppress checkpoints before reaching
> consistency(?).

Suppressing restartpoints is not enough to completely close the issue.
With the case I am mentioning above, it is possible to run
restartpoints after consistency has been reached without the hot
standby initialization done, causing the same failure when attempting
to truncate pg_subtrans/ in a restart point :/

[1]: https://www.postgresql.org/message-id/20230201153252.l6kcfum7trdovw2b@alap3.anarazel.de
--
Michael

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Andrey Lepikhov 2023-02-27 06:23:57 Re: Clause accidentally pushed down ( Possible bug in Making Vars outer-join aware)
Previous Message Kyotaro Horiguchi 2023-02-27 03:01:01 Re: BUG #17744: Fail Assert while recoverying from pg_basebackup