Re: The danger of deleting backup_label

From: David Steele <david(at)pgmasters(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>, Stephen Frost <sfrost(at)snowman(dot)net>
Subject: Re: The danger of deleting backup_label
Date: 2023-10-16 15:45:07
Message-ID: 9dc67363-5200-7395-3ddb-b50171800bc6@pgmasters.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On 10/16/23 10:55, Robert Haas wrote:
> On Sat, Oct 14, 2023 at 11:33 AM David Steele <david(at)pgmasters(dot)net> wrote:
>> All of this is fixable in HEAD, but seems incredibly dangerous to back
>> patch. Even so, I have attached the patch in case somebody sees an
>> opportunity that I do not.
>
> I really do not think we should be even thinking about back-patching
> something like this. It's clearly not a bug fix, although I'm sure
> that someone can try to characterize it that way, if they want to make
> the well-worn argument that any behavior they don't like is a bug. But
> that's a pretty lame argument. Usage errors on the part of users are
> not bugs, even if we've coded the software in such a way as to make
> those errors more likely.

Hmmm, the reason to back patch this is that it would fix [1], which sure
looks like a problem to me even if it is not a "bug". We can certainly
require backup software to retry pg_control until the checksum is valid
but that seems like a pretty big ask, even considering how complicated
backup is.

I investigated this as a solution to [1] because it seemed like a better
solution that what we have in [1]. But once I got into the weeds it was
obvious that this wasn't going to be something we could back patch.

> I think what we ought to be talking about is whether a change like
> this is a good idea even in master. I don't think it's a terrible
> idea, but I'm also not sure that it's a good idea. The problem is that
> if you're doing the right thing with your backup_label, then this is
> unnecessary, and if you're doing the wrong thing, then why should you
> do the right thing about this?

First and foremost, this solves the problem of pg_control being torn
when read by the backup software. It can't be torn if it is not there.

There are also other advantages -- we can massage pg_control before
writing it back out. This already happens, but before that happens there
is a copy of the (prior) running pg_control there that can mess up the
process.

> I mean, admittedly you can't just
> ignore a fatal error, but I think people will just run pg_resetwal,
> which is even worse than starting from the wrong checkpoint.

If you start from the last checkpoint (which is what will generally be
stored in pg_control) then the effect is pretty similar.

> I feel
> like in cases where a customer I'm working with has a bad backup,
> their entire focus is on doing something to that backup to get a
> running system back, whatever it takes. It's already too late at that
> point to fix the backup procedure - they only have the backups they
> have. You could hope people would do test restores before disaster
> strikes, but people who are that prepared are probably running a real
> backup tool and will never have this problem in the first place.

Right now the user can remove backup_label and get a "successful"
restore and not realize that they have just corrupted their cluster.
This is independent of the backup/restore tool doing all the right things.

My goal here is to narrow the options to try and make it so there is
*one* valid procedure that will work. For this patch the idea is that
they *must* start Postgres to get a valid pg_control from the
backup_label. Any other action leads to a fatal error.

Note that the current patch is very WIP and does not actually do
everything I'm talking about here. I was just trying to see if it could
be used to solve the problem in [1]. It can't.

> Perhaps that's all too pessimistic. I don't know. Certainly, other
> people can have experiences that are different than mine. But I feel
> like I struggle to think of a case where this would have prevented a
> bad outcome, and that makes me wonder whether it's really a good idea
> to complicate the system.

I'm specifically addressing cases like those that came up (twice!) in
[2]. This is the main place I see people stumbling these days. If even
hackers can make this mistake then we should do better.

Regards,
-David

[1]
https://www.postgresql.org/message-id/20221123014224.xisi44byq3cf5psi%40awork3.anarazel.de
[2] [1]
https://www.postgresql.org/message-id/flat/CAM_vCudkSjr7NsNKSdjwtfAm9dbzepY6beZ5DP177POKy8%3D2aw%40mail.gmail.com#746e492bfcd2667635634f1477a61288

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Laurenz Albe 2023-10-16 15:48:43 Re: Requiring recovery.signal or standby.signal when recovering with a backup_label
Previous Message Tomas Vondra 2023-10-16 15:34:44 Re: index prefetching