Re: Add recovery to pg_control and remove backup_label

From: Stephen Frost <sfrost(at)snowman(dot)net>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: David Steele <david(at)pgmasters(dot)net>, Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>, Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add recovery to pg_control and remove backup_label
Date: 2023-11-28 15:42:50
Message-ID: ZWYKeqWykdwBMUar@tamriel.snowman.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Greetings,

* Robert Haas (robertmhaas(at)gmail(dot)com) wrote:
> On Sun, Nov 26, 2023 at 3:42 AM Stephen Frost <sfrost(at)snowman(dot)net> wrote:
> > What would really be helpful would be hearing from these individuals
> > directly as to what the issues are with the changes, such that perhaps
> > we can do things better in the future to avoid whatever the issue is
> > they're having with the changes. Simply saying we shouldn't make
> > changes in this area isn't workable and the constant push-back is
> > actively discouraging to folks trying to make improvements. Obviously
> > it's a biased view, but we've not had issues making the necessary
> > adjustments in pgbackrest with each release and I feel like if the
> > authors of wal-g or barman did that they would have spoken up.
>
> I'm happy if people show up to comment on proposed changes, but I
> think you're being a little bit unrealistic here. I have had to help
> plenty of people who have screwed up their backups in one way or
> another, generally by using some home-grown script, sometimes by
> misusing some existing backup tool. Those people are EDB customers;
> they don't read and participate in discussions here. If they did,
> perhaps they wouldn't be paying EDB to have me and my colleagues sort
> things out for them when it all goes wrong. I'm not trying to say that
> EDB doesn't have customers who participate in mailing list
> discussions, because we do, but it's a small minority, and I don't
> think that should surprise anyone. Moreover, the people who don't
> wouldn't necessarily have the background, expertise, or *time* to
> assess specific proposals in detail. If your point is that my
> perspective on what's helpful or unhelpful is not valid because I've
> only helped 30 people who had problems in this area, but that the
> perspective of those 30 people who were helped would be more valid,
> well, I don't agree with that. I think your perspective and David's
> are valid precisely *because* you've worked a lot on pgbackrest and no
> doubt interacted with lots of users; I think Andres's perspective is
> valid precisely *because* of his experience working with the fleet at
> Microsoft and individual customers at EDB and 2Q before that; and I
> think my perspective is valid for the same kinds of reasons.

I didn't mean to imply that anyone's perspective wasn't valid. I was
simply trying to get at the root question of: what *is* the issue with
the changes that are being made? If the answer to that is: we made
this change, which was hard for folks to deal with, and could have
been avoided by doing X, then I really, really want to hear what X
was! If the answer is, well, the changes weren't hard, but we didn't
like having to make any changes at all ... then I just don't have any
sympathy for that. People who write backup software for PG, be it
pgbackrest authors, wal-g authors, or homegrown script authors, will
need to adapt between major versions as we discover things that are
broken (such as exclusive mode, and such as the clear risk that's been
demonstrated of a torn copy of pg_control getting copied, resulting in
a completely invalid backup) and fix them.

> I am more in agreement with the idea that it would be nice to hear
> from backup tool authors, but I think even that has limited value.
> Surely we can all agree that if the backup tool is correctly written,
> none of this matters, because you'll make the tool do the right things
> and then you'll be fine. The difficulty here, and the motivation
> behind this proposal and others like it, is that too many users fail
> to follow the procedure correctly. If we hear from the authors of
> well-written backup tools, I expect they will tell us they can adapt
> their tool to whatever we do. And if we hear from the authors of
> poorly-written tools, well, I don't think their opinions would form a
> great basis for making decisions.

Uhhh. No, I disagree with this- I'd argue that pgbackrest was broken
until the most recently releases where we implemented a check to ensure
that the pg_control we copy has a valid PG CRC. Did we know it was
broken before this discussion? No, but that doesn't change the fact
that we certainly could have ended up copying an invalid pg_control and
thus have an invalid backup, which even our 'pgbackrest verify' wouldn't
have caught because that just checks that the checksum that pgbackrest
calculates for every file hasn't changed since we copied it- but that
didn't do anything for the issue about pg_control having an invalid
internal checksum due to a torn write when we copied it.

So, yes, it does matter. We didn't make pgbackrest do the right thing
in this case because we thought it was true that you couldn't get a torn
read of pg_control; Thomas showed that wasn't true and that puts all of
our users at risk. Thankfully somewhat minimal since we always copy
pg_control from the primary ... but still, it's not right, and we've
now taken steps to address it. Unfortunately, other tools are going to
have a more difficult time because they're not written in C, but we
still care about them, and that's why we're pushing for this change- to
allow them to get a pretty much guaranteed valid pg_control from PG to
store without having to figure out how to validate it themselves.

> > [ lengthy discussion of tools that don't work any more ]
>
> What confuses me here is that you seem to be arguing that we should
> *once again* make a breaking change to the backup API, but at the same
> time you're acknowledging that there are plenty of tools out there on
> the Internet that have gotten broken by previous rounds of changes.

The broken ones aren't being maintained. Yes, I'm happy to have those
explicitly and clearly broken. I don't want people using outdated,
broken, and unmaintained tools to backup their PG databases.

> It's only one step from there to conclude that whacking the API around
> does more harm than good, but you seem to reject that conclusion.

We change the API because it objectively, clearly, addresses real issues
that users can run into that will cause them to have invalid backups if
left the way it is. That backup software authors need to adjust to this
isn't a bad thing- it's a good thing, because we're fixing things and
they should be thrilled to have these issues addressed that they may not
have even considered.

> Personally, I haven't yet seen any evidence that the removal of
> exclusive backup mode made any real difference one way or the other. I
> think I've heard about people needing to adjust code for it, but not
> about that being a problem. I have yet to run into anyone who was
> previously using it but, because it was deprecated, switched to doing
> something better and safer. Have you?

I'm glad that people haven't had a problem adjusting their code to the
removal of exclusive backup mode, that's good, and leaves me, again, a
bit confused at what the issue here is about changing things- apparently
people don't actually have a problem with it, yet it keeps getting
raised as an issue every time we change things in this area. I don't
understand that.

I'm not following the question entirely, I don't think. Most backup
tool authors actively changed to using non-exclusive backup when
exclusive backup mode was deprecated, certainly pgbackrest did and we've
been using non-exclusive backup mode since it was available. Are you
saying that, because everyone moved off of it, we should have kept it?
In that case the answer is clearly no- omnipitr, at the least, didn't
update to non-exclusive and therefore continued to run with the risk
that a crash during a backup would result in a cluster that wouldn't
start without manual intervention (an issue I've definitely heard about
a number of times, even recently) and that manual intervention (remove
the backup_label file) actively results in a *corrupt* cluster if the
user is actually restoring from a backup, which makes it really terrible
direction to give someone. Here, use this hack- but only if you're 100%
coming back from a crash and absolutely never, ever, ever if you're
actually restoring from a backup.

Thanks!

Stephen

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Tristan Partin 2023-11-28 15:44:18 Re: SSL tests fail on OpenSSL v3.2.0
Previous Message Tom Lane 2023-11-28 15:42:27 Re: SSL tests fail on OpenSSL v3.2.0