An improvement of ProcessTwoPhaseBuffer logic

From: "Vitaly Davydov" <v(dot)davydov(at)postgrespro(dot)ru>
To: "pgsql-hackers" <pgsql-hackers(at)postgresql(dot)org>
Subject: An improvement of ProcessTwoPhaseBuffer logic
Date: 2024-12-24 13:26:32
Message-ID: 11e597-676ab680-8d-374f23c0@145466129
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Dear Hackers,

I would like to discuss ProcessTwoPhaseBuffer function. It reads two-phase transaction states from disk or the WAL. It takes xid as well as some other input parameters and executes the following steps:

Step #1: Check if xid is committed or aborted in clog (TransactionIdDidCommit, TransactionIdDidAbort)
Step #2: Check if xid is not equal or greater than ShmemVariableCache->nextXid
Step #3: Read two-phase state for the specified xid from memory or the corresponding file and returns it

In some, very rare scenarios, the postgres instance will newer recover because of such logic. Imagine, that the two_phase directory contains some files with two-phase states of transactions of distant future. I assume, it can happen if some WAL segments are broken and ignored (as well as clog data) but two_phase directory was not broken. In recovery, postgresql reads all the files in two_phase and tries to recover two-phase states.

The problem appears in the functions TransactionIdDidCommit or TransactionIdDidAbort. These functions may fail with the FATAL message like below when no clog state on disk is available for the xid:

FATAL: could not access status of transaction 286331153
DETAIL: Could not open file "pg_xact/0111": No such file or directory.

Such error do not allow the postgresql instance to be started.

My guess, if to swap Step #1 with Step #2 such error will disappear because transactions will be filtered when comparing xid with ShmemVariableCache->nextXid before accessing clog. The function will be more robust. In general, it works but I'm not sure that such logic will not break some rare boundary cases. Another solution is to catch and ignore such error, but the original solution is the simpler one. I appreciate any thoughts concerning this topic. May be, you know some cases when such change in logic is not relevant?

Thank you in advance!

With best regards,
Vitaly

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-12-24 14:00:01 Re: Regression tests fail on OpenBSD due to low semmns value
Previous Message Peter Eisentraut 2024-12-24 13:09:35 Re: remove pgrminclude?