Re: Incremental backup from a streaming replication standby fails

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
Cc: pgsql-hackers(at)postgresql(dot)org
Subject: Re: Incremental backup from a streaming replication standby fails
Date: 2024-07-19 14:52:45
Message-ID: CA+TgmobToSs9DNynYA-iQt9m5zVAnyDCbfwGwcmEOA5zecPk4w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 15, 2024 at 11:27 AM Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at> wrote:
> On Sat, 2024-06-29 at 07:01 +0200, Laurenz Albe wrote:
> > I played around with incremental backup yesterday and tried $subject
> >
> > The WAL summarizer is running on the standby server, but when I try
> > to take an incremental backup, I get an error that I understand to mean
> > that WAL summarizing hasn't caught up yet.
> >
> > I am not sure if that is working as designed, but if it is, I think it
> > should be documented.
>
> I played with this some more. Here is the exact error message:
>
> ERROR: manifest requires WAL from final timeline 1 ending at 0/1967C260, but this backup starts at 0/1967C190
>
> By trial and error I found that when I run a CHECKPOINT on the primary,
> taking an incremental backup on the standby works.
>
> I couldn't fathom the cause of that, but I think that that should either
> be addressed or documented before v17 comes out.

I had a feeling this was going to be confusing. I'm not sure what to
do about it, but I'm open to suggestions.

Suppose you take a full backup F; replay of that backup will begin
with a checkpoint CF. Then you try to take an incremental backup I;
replay will begin from a checkpoint CI. For the incremental backup to
be valid, it must include all blocks modified after CF and before CI.
But when the backup is taken on a standby, no new checkpoint is
possible. Hence, CI will be the most recent restartpoint on the
standby that has occurred before the backup starts. So, if F is taken
on the primary and then I is immediately taken on the standby without
the standby having done a new restartpoint, or if both F and I are
taken on the standby and no restartpoint intervenes, then CF=CI. In
that scenario, an incremental backup is pretty much pointless: every
single incremental file would contain 0 blocks. You might as well just
use the backup you already have, unless one of the non-relation files
has changed. So, except in that unusual corner case, the fact that the
backup fails isn't really costing you anything. In fact, there's a
decent chance that it's saving you from taking a completely useless
backup.

On the primary, this doesn't occur, because there, each new backup
triggers a new checkpoint, so you always have CI>CF.

The error message is definitely confusing. The reason I'm not sure how
to do better is that there is a large class of errors that a user
could make that would trigger an error of this general type. I'm
guessing that attempting a standby backup with CF=CI will turn out to
be the most common one, but I don't think it'll be the only one that
ever comes up. The code in PrepareForIncrementalBackup() focuses on
what has gone wrong on a technical level rather than on what you
probably did to create that situation. Indeed, the server doesn't
really know what you did to create that situation. You could trigger
the same error by taking a full backup on the primary and then try to
take an incremental based on that full backup on a time-delayed
standby (or a lagging standby) whose replay position was behind the
primary, i.e. CI<CF.

More perversely, you could trigger the error by spinning up a standby,
promoting it, taking a full backup, destroying the standby, removing
the timeline history file from the archive, spinning up a new standby,
promoting onto the same timeline ID as the previous one, and then
trying to take an incremental backup relative to the full backup. This
might actually succeed, if you take the incremental backup at a later
LSN than the previous full backup, but, as you may guess, terrible
things will happen to you if you try to use such a backup. (I hope you
will agree that this would be a self-inflicted injury; I can't see any
way of detecting such cases.) If the incremental backup LSN is earlier
than the previous full backup LSN, this error will trigger.

So, given all the above, what can we do here?

One option might be to add an errhint() to the message. I had trouble
thinking of something that was compact enough to be reasonable to
include and yet reasonably accurate and useful, but maybe we can
brainstorm and figure something out. Another option might be to add
more to the documentation, but it's all so complicated that I'm not
sure what to write. It feels hard to make something that is brief
enough to be worth including, accurate enough to help more than it
hurts, and understandable enough that people who run into this will be
able to make use of it.

I think I'm a little too close to this to really know what the best
thing to do is, so I'm happy to hear suggestions from you and others.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-07-19 15:06:47 Re: Build with LTO / -flto on macOS
Previous Message Junwang Zhao 2024-07-19 14:48:31 Re: Add new COPY option REJECT_LIMIT