Re: recovery starting when backup_label exists, but not recovery.signal

From: David Steele <david(at)pgmasters(dot)net>
To: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
Cc: PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: recovery starting when backup_label exists, but not recovery.signal
Date: 2019-09-27 18:01:11
Message-ID: c4909bdd-4a4d-31b7-c705-aabf3f1273e0@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 9/27/19 4:34 AM, Fujii Masao wrote:
> On Fri, Sep 27, 2019 at 3:36 AM David Steele <david(at)pgmasters(dot)net> wrote:
>>
>> On 9/24/19 1:25 AM, Fujii Masao wrote:
>>>
>>> When backup_label exists, the startup process enters archive recovery mode
>>> even if recovery.signal file doesn't exist. In this case, the startup process
>>> tries to retrieve WAL files by using restore_command. Then, at the beginning
>>> of the archive recovery, the contents of backup_label are copied to pg_control
>>> and backup_label file is removed. This would be an intentional behavior.
>>
>>> But I think the problem is that, if the server shuts down during that
>>> archive recovery, the restart of the server may cause the recovery to fail
>>> because neither backup_label nor recovery.signal exist and the server
>>> doesn't enter an archive recovery mode. Is this intentional, too? Seems No.
>>>
>>> So the problematic scenario is;
>>>
>>> 1. the server starts with backup_label, but not recovery.signal.
>>> 2. the startup process enters an archive recovery mode because
>>> backup_label exists.
>>> 3. the contents of backup_label are copied to pg_control and
>>> backup_label is deleted.
>>
>> Do you mean deleted or renamed to backup_label.old?
>
> Sorry for the confusing wording..
> I meant the following code that renames backup_label to .old, in StartupXLOG().

Right, that makes sense.

>>
>> I assume you have a repro? Can you give more details?
>
> What I did is:
>
> 1. Start PostgreSQL server with WAL archiving enabled.
> 2. Take an online backup by using pg_basebackup, for example,
> $ pg_basebackup -D backup
> 3. Execute many write SQL to generate lots of WAL files. During that execution,
> perform CHECKPOINT to remove some WAL files from pg_wal directory.
> You need to repeat these until you confirm that there are many WAL files
> that have already been removed from pg_wal but exist only in archive area.
> 4. Shutdown the server.
> 5. Remove PGDATA and restore it from backup.
> 6. Set up restore_command.
> 7. (Forget to put recovery.signal)
> That is, in this scenario, you want to recover the database up to
> the latest WAL records in the archive area. So you need to start archive
> recovery by setting restore_command and putting recovery.signal.
> But the problem happens when you forget to put recovery.signal.
> 8. Start PostgreSQL server.
> 9. Shutdown the server while it's restoring archived WAL files and replaying
> them. At this point, you will notice that the archive recovery starts
> even though recovery.signal doesn't exist. So even archived WAL files
> are successfully restored at this step.
> 10. Restart PostgreSQL server. Since neither backup_label or recovery.signal
> exist, crash recovery starts and fail to restore the archived WAL files.
> So you fail to recover the database up to the latest WAL record
> in archive
> directory. The recovery will finish at early point.

Yes, I see it now. I did not have enough WAL to make it work before, as
I suspected.

>>> One idea to fix this issue is to make the above step #3 remember that
>>> backup_label existed, in pg_control. Then we should make the subsequent
>>> recovery enter an archive recovery mode if pg_control indicates that
>>> even if neither backup_label nor recovery.signal exist. Thought?
>>
>> That seems pretty invasive to me at this stage. I'd like to reproduce
>> it and see if there are alternatives.
>>
>> Also, are you sure this is a new behavior?
>
> In v11 or before, if backup_label exists but not recovery.conf,
> the startup process doesn't enter an archive recovery mode. It starts
> crash recovery in that case. So the bahavior is somewhat different
> between versions.

Agreed. Since recovery options can be used in the presence of
backup_label *or* recovery.signal (or standby.signal for that matter)
this does represent a change in behavior. And it doesn't appear to be a
beneficial change.

Regards,
--
-David
david(at)pgmasters(dot)net

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2019-09-27 18:14:40 Re: A problem presentaion about ECPG, DECLARE STATEMENT
Previous Message David Steele 2019-09-27 17:56:06 Re: recovery starting when backup_label exists, but not recovery.signal