Re: An improvement of ProcessTwoPhaseBuffer logic

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: An improvement of ProcessTwoPhaseBuffer logic
Date: 2025-01-29 23:57:23
Message-ID: Z5rAY_Gl4yVeX67L@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jan 29, 2025 at 03:00:54PM +0300, Vitaly Davydov wrote:
> It seems, there are much deeper problems with twophase transactions
> as I thought. I'm interested in fixing twophase transactions,
> because I support a solution which actively uses twophase
> transactions. I'm interested to get more deeply into the twophase
> functionality. Below, I just want to clarify some ideas behind
> twophase transactions. I appreciate, if you comment my point of
> view, or just ignore this email if you find it too naive and boring.

(Please be careful about your email indentation. These are hard to
read.)

> Two phase files are created after checkpoint to keep twophase states
> on disk after WAL truncation. For transactions, that are inside the
> checkpoint horizon, we do not create two phase files because the
> states are currently stored in the WAL.

Yeah, the oldest XID horizon stored in the checkpoint records would be
stuck if you have a long-running 2PC transaction. It's actually
something I have been brewing a bit for the last few days: we should
discard early at recovery files that are older than the horizon in the
checkpoint record we've retrieved at the beginning of recovery. So
the early check can be tighter with past files.

> Based on the thesis above, I guess, we have to read only those
> twophase files which are related to the transactions before the
> latest checkpoint. Its full xid should be lesser than
> TransamVariables->nextXid (which is the same as
> ControlFile->checkPointCopy.nextXid at the moment of StartupXLOG ->
> restoreTwoPhaseData call). The files with greater (or equal) full
> xids, should be ignored and removed. That's all what we need in
> restoreTwoPhaseData, I believe.

Yeah. It is important to do that only in restoreTwoPhaseData(). The
check about already-aborted and already-committed 2PC files is
something we have to keep as well. We should only do it at the end of
recovery.

> In the current implementation, such check is applied in
> restoreTwoPhaseData -> ProcessTwoPhaseBuffer but after checking for
> xid in CLOG. I'm not sure, why we check CLOG here. Once we truncate
> the WAL on checkpoint and save twophase states into pg_twophase,
> these files must store states of real transactions from past.

Based on my analysis, the CLOG check ought to happen only in
RecoverPreparedTransactions(), which is the last step taken at
recovery. The future and past checks only need to happen in
restoreTwoPhaseData(). I'm been bumping my head on my desk on this
area for a few days now, but I think that the solution is simple:
these checks should be moved *outside* ProcessTwoPhaseBuffer() and
applied one layer higher in the two places I've mentioning above,
where they matter.

> There is another question about the loading order of twophase files
> but I think it doesn't matter in which order we load these
> files. But I would prefer to load it in full xid ascending order.

I don't see a need in doing that in the short-term, but perhaps you
have a point with the ordering of the entries in the TwoPhaseState
shmem data if there are many entries to handle. Another idea would be
to switch to a hash table.

Note that I've actually found what looks like a potential data
corruption bug in the logic while doing this investigation and adding
TAP tests to automate all that: if we detect a 2PC file as already
aborted or committed in ProcessTwoPhaseBuffer() while scanning
TwoPhaseState->numPrepXacts, we could finish by calling
PrepareRedoRemove(), which itself *manipulates* TwoPhaseState,
decrementing numPrepXacts. This can leave dangling entries in
TwoPhaseState if you have multiple TwoPhaseState entries (my
implemented TAP tests have been doing exactly that). This is relevant
at the end of recovery where CLOG lookups are OK to do.

So the situation is a bit of a mess in all the branches, a bit worse
in 17~ due to the fact that epochs are poorly integrated, but I'm
getting there with a set of patches mostly ready to be sent, and I
think that this would be OK for a backpatch. My plan is to start a
new thread to deal with the matter, because that's a bit larger than
the topic you have raised on this thread. I'll add you and Noah in CC
for awareness.
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2025-01-30 00:51:13 Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided.
Previous Message Peter Smith 2025-01-29 23:52:38 Re: Improve error handling for invalid slots and ensure a same 'inactive_since' time for inactive slots