Re: deferred writing of two-phase state files adds fragility

From: Andres Freund <andres(at)anarazel(dot)de>
To: Robert Haas <robertmhaas(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: deferred writing of two-phase state files adds fragility
Date: 2024-12-04 23:36:17
Message-ID: bd33awtmersrtcfpira7pkoglvn72k4kxg3xifg6mv5ozchyye@x5jzks5kufww
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-12-04 12:04:47 -0500, Robert Haas wrote:
> Let's suppose that you execute PREPARE TRANSACTION and, before the
> next CHECKPOINT, the WAL record for the PREPARE TRANSACTION gets
> corrupted on disk. This might seem like an unlikely scenario, and it
> is, but we saw a case at EDB not too long ago.
>
> To a first approximation, the world ends. You can't execute COMMIT
> TRANSACTION or ROLLBACK TRANSACTION, so there's now way to resolve the
> prepared transaction.

Is 2PC really that special in that regard? If the WAL that contains the
checkpoint record itself gets corrupted, you're also in a world of hurt, once
you shut down? Or, to a slightly lower degree, if there's any corrupted
record between the redo pointer and the checkpoint record. And that's
obviously a lot more records than just 2PC COMMIT/RECORD, making the
likelihood of some corruption higher.

The only reason it seems somewhat special is that it can more easily be
noticed while the server is running.

How did this corruption actually come about? Did it actually really just
affect that single WAL segment? Somehow that doesn't seem too likely.

> You also can't checkpoint, because that requires
> writing a twophase state file for the prepared transaction, and that's
> not possible because the WAL can't be read. What you have is a mostly
> working system, except that it's going to bloat over time because the
> prepared transaction is going to hold back the VACUUM horizon. And you
> basically have no way out of that problem, because there's no tool
> that says "I understand that my database is going to be corrupted,
> that's ok, just forget about that twophase transaction".

> If you shut down the database, then things become truly awful. You
> can't get a clean shutdown because you can't checkpoint, so you're
> going to resume recovery from the last checkpoint before the problem
> happened, find the corrupted WAL, and fail. As long as your database
> was up, you at least had the possibility of getting all of your data
> out of it by running pg_dump, as long as you can survive the amount of
> time that's going to take. And, if you did do that, you wouldn't even
> have corruption. But once your database has gone down, you can't get
> it back up again without running pg_resetwal. Running pg_resetwal is
> not very appealing here -- first because now you do have corruption
> whereas before the shutdown you didn't, and second because the last
> checkpoint could already be a long time in the past, depending on how
> quickly you realized you have this problem.

pg_resetwal also won't actually remove the pg_twophase/* files if they did end
up getting created. But that's probably not a too common scenario.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2024-12-04 23:39:00 Re: SCRAM pass-through authentication for postgres_fdw
Previous Message Matthias van de Meent 2024-12-04 23:23:54 Re: Make tuple deformation faster