Re: deferred writing of two-phase state files adds fragility

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: Andres Freund <andres(at)anarazel(dot)de>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: deferred writing of two-phase state files adds fragility
Date: 2024-12-05 16:21:12
Message-ID: CA+TgmoZXtYoybG2Rj5CAUe9hMBBPjx-qRKU8VDK8OU6vs0uEtw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Dec 4, 2024 at 6:36 PM Andres Freund <andres(at)anarazel(dot)de> wrote:
> Is 2PC really that special in that regard? If the WAL that contains the
> checkpoint record itself gets corrupted, you're also in a world of hurt, once
> you shut down? Or, to a slightly lower degree, if there's any corrupted
> record between the redo pointer and the checkpoint record. And that's
> obviously a lot more records than just 2PC COMMIT/RECORD, making the
> likelihood of some corruption higher.

Sure, that's true. I think my point is just that in a lot of cases
where the WAL gets corrupted, you can eventually move on from the
problem. Let's say some bad hardware or some annoying "security"
software decides to overwrite the most recent CHECKPOINT record. If
you go down at that point, you're sad, but if you don't, the server
will eventually write a new checkpoint record and then the old, bad
one doesn't really matter any more. If you have standbys you may need
to rebuild them and if you need logical decoding you may need to
recreate subscriptions or something, but since you didn't really end
up needing the bad WAL, the fact that it happened doesn't have to
cripple the system in any enduring sense.

> The only reason it seems somewhat special is that it can more easily be
> noticed while the server is running.

I think there are two things that make it special. The first is that
this is nearly the only case where the primary has a critical
dependency on the WAL in the absence of a crash. The second is that,
AFAICT, there's no reasonable recovery strategy.

> How did this corruption actually come about? Did it actually really just
> affect that single WAL segment? Somehow that doesn't seem too likely.

I don't know and might not be able to tell you even if I did.

> pg_resetwal also won't actually remove the pg_twophase/* files if they did end
> up getting created. But that's probably not a too common scenario.

Sure, but also, you can remove them yourself. IME, WAL corruption is
one of the worst case scenarios in terms of being able to get the
database back into reasonable shape. I can advise a customer to remove
an entire file if I need to; I have also written code to create fake
files to replace real ones that were lost; I have also written code to
fix broken heap pages. But when the problem is WAL, how are you
supposed to repair it? It's very difficult, I think, bordering on
impossible. Does anyone ever try to reconstruct a valid WAL stream to
allow replay to continue? AFAICT the only realistic solution is to run
pg_resetwal and hope that's good enough. That's often acceptable, but
it's not very nice in a case like this. Because you can't checkpoint,
you have no way to force the system to flush all dirty pages before
shutting it down, which means you may lose a bunch of data if you shut
down to run pg_resetwal. But if you don't shut down then you have no
way out of the bad state unless you can repair the WAL.

I don't think this is going to be a frequent case, so maybe it's not
worth doing anything about. But it does seem objectively worse than
most failure scenarios, at least to me.

--
Robert Haas
EDB: http://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2024-12-05 16:27:23 Re: Proposal: Role Sandboxing for Secure Impersonation
Previous Message Alvaro Herrera 2024-12-05 16:19:18 Re: code contributions for 2024, WIP version