Re: post-freeze damage control

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>
Cc: Michael Paquier <michael(at)paquier(dot)xyz>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: post-freeze damage control
Date: 2024-04-09 12:46:46
Message-ID: CA+TgmoZ8ZJPHmt67ijq7fWc=27xE1tfsakRwd+3SGjzAW0FpuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Apr 9, 2024 at 7:24 AM Tomas Vondra
<tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> I think it's a bit more nuanced, because it's about backups/restore. The
> bug might be subtle, and you won't learn about it until the moment when
> you need to restore (or perhaps even long after that). At which point
> "You might have taken the backup in some other way." is not really a
> viable exit route.
>
> Anyway, I'm still not worried about this particular feature, and I'll
> keep doing the stress testing.

In all sincerity, I appreciate the endorsement. Basically what's been
scaring me about this feature is the possibility that there's some
incurable design flaw that I've managed to completely miss. If it has
some more garden-variety bugs, that's still pretty bad: people will
potentially lose data and be unable to get it back. But, as long as
we're able to find the bugs and fix them, the situation should improve
over time until, hopefully, everybody trusts it roughly as much as we
trust, say, crash recovery. Perhaps even a bit more: I think this code
is much better-written than our crash recovery code, which has grown
into a giant snarl that nobody seems able to untangle, despite
multiple refactoring attempts. However, if there's some reason why the
approach is fundamentally unsound which I and others have failed to
detect, then we're at risk of shipping a feature that is irretrievably
broken. That would really suck.

I'm fairly hopeful that there is no such design defect: I certainly
can't think of one. But, it's much easier to imagine an incurable
problem here than with, say, the recent pruning+freezing changes.
Those changes might have bugs, and those bugs might be hard to find,
but if they do exist and are found, they can be fixed. Here, it's a
little less obvious that that's true. We're relying on our ability, at
incremental backup time, to sort out from the manifest and the WAL
summaries, what needs to be included in the backup in order for a
subsequent pg_combinebackup operation to produce correct results. The
basic idea is simple enough, but the details are complicated, and it
feels like a subtle defect in the algorithm could potentially scuttle
the whole thing. I'd certainly appreciate having more smart people try
to think of things that I might have overlooked.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2024-04-09 13:00:49 RE: Synchronizing slots from primary to standby
Previous Message Joe Conway 2024-04-09 12:43:57 Re: Security lessons from liblzma