Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Cc: Noah Misch <noah(at)leadboat(dot)com>, Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
Subject: Issues with 2PC at recovery: CLOG lookups and GlobalTransactionData
Date: 2025-01-30 06:36:20
Message-ID: Z5sd5O9JO7NYNK-C@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi all,
(Noah and Vitaly in CC, who got involved in the previous discussion)

This thread is a continuation of the discussion that happened here:
https://www.postgresql.org/message-id/13b5b6-676c3080-4d-531db900%4047931709

And I am beginning a new thread about going through an issue that Noah
has mentioned at [1], which is that the 2PC code may attempt to do
CLOG lookups at very early stage of recovery, where the cluster is not
in a consistent state.

I have dug into this issue, and while working on it, noticed a
separate bug related to how ProcessTwoPhaseBuffer() decides to handle
2PC entries in its shmem area GlobalTransactionData. Most of the
callers of this routine follow this pattern:
for (i = 0; i < TwoPhaseState->numPrepXacts; i++)
{
GlobalTransaction gxact = TwoPhaseState->prepXacts[i];
[...]
buf = ProcessTwoPhaseBuffer(gxact->xid, ...);
[...]
}

That may look OK, but we may finish by calling PrepareRedoRemove(),
which itself does a RemoveGXact(), manipulating directly TwoPhaseState
while we're looping with it. At first, I thought that it can be
pretty bad because it is possible to leave TwoPhaseState entries
around at the end of recovery, assuming that these are still live
transactions in the cluster, keeping locks, etc. However, and as far
as I can see, this would be only reachable if there was 2PC data on
disk inconsistent with the cluster. Base backups via custom tools
would be a problem, making the problem less worse.

My conclusion is that while these two issues are independent, their
fix is the same. My reasoning is that we'd better extract the
internals of ProcessTwoPhaseBuffer() that do the CLOG lookup and
future-2PC checks and move them to the places where they are relevant,
in a fashion similar to the original 2PC code of recovery in the 2005
area with d0a89683a3a4, where CLOG lookups were only attempted at the
end of recovery, with future lookups done at its beginning.

In our case, CLOG lookups are safe to do in
RecoverPreparedTransactions(), where 2PC transaction data is
reinstated before the server is ready for writes. The cleanup of the
2PC entries should happen once all the entries have been scanned. The
checks for future files can be done in restoreTwoPhaseData(), code
path taken early at recovery when scanning the 2PC files on disk.

Another thing that's struck me a bit is that it is possible to make
the checks done at the beginning of recovery tighter, by checking if
the 2PC file involved is older than the cluster-wide oldest xmin,
something we know from the checkpoint record.

Attached is a set of patches for HEAD, down to v13. Most of them have
finished by having conflicts. The patches in ~16 are straight-forward
as 2PC files don't have epochs. The patches in 17~ are more
invasive because FullTransactionIds are not integrated fully in
the 2PC stack (aka 5a1dfde8334b). This relies on 81772a495ec9 that
has made the work simpler. Still, you can notice that the basics are
the same as the ~16 patches, where the logic from
ProcessTwoPhaseBuffer is extracted.

I have gone back and forth with the versions of 17~ for a couple of
days and decided to bite the bullet with basics taken from the area of
[2], integrating FullTransactionIds deeper so as it is deterministic
to remove 2PC files from disk. That's more invasive, perhaps that's
for the best at the end keeping 17~ more consistent, meaning less
conflicts.

Each patch has a set of regression tests that check all these
conditions. They are a bit artistic, still the first test has caught
the second problem while I was looking at ways to automate the first
problem.

This needs more eyes, so I'll go add an entry in the CF. Patches for
stable branches are labelled with .txt, to avoid any interference with
the CF bot.

Thoughts and comments are welcome.

[1]: https://www.postgresql.org/message-id/20250117005221.05.nmisch@google.com
[2]: https://www.postgresql.org/message-id/20250116205254.65.nmisch@google.com
--
Michael

Attachment Content-Type Size
0001-Fix-set-of-issues-with-2PC-transaction-handli-master.patch text/x-diff 51.0 KB
0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-13.txt text/plain 12.5 KB
0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-14.txt text/plain 12.3 KB
0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-15.txt text/plain 12.3 KB
0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-16.txt text/plain 12.3 KB
0001-Fix-set-of-issues-with-2PC-transaction-handling-dur-17.txt text/plain 51.3 KB

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Amit Kapila 2025-01-30 06:38:08 Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.
Previous Message Nisha Moond 2025-01-30 06:30:02 Re: Improve error handling for invalid slots and ensure a same 'inactive_since' time for inactive slots