Re: An improvement of ProcessTwoPhaseBuffer logic

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: An improvement of ProcessTwoPhaseBuffer logic
Date: 2024-12-25 05:04:40
Message-ID: Z2uSaKlwYzHD1cbk@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 24, 2024 at 04:26:32PM +0300, Vitaly Davydov wrote:
> In some, very rare scenarios, the postgres instance will newer
> recover because of such logic. Imagine, that the two_phase directory
> contains some files with two-phase states of transactions of distant
> future. I assume, it can happen if some WAL segments are broken and
> ignored (as well as clog data) but two_phase directory was not
> broken. In recovery, postgresql reads all the files in two_phase and
> tries to recover two-phase states.

Well, one recent issue in this area is a bug fixed by cf4401fe6cf5:
https://www.postgresql.org/message-id/tencent_A7F059B5136A359625C7B2E4A386B3C3F007@qq.com

If you see other bug patterns like this one, please let me know.
There are good changes that I could be involved in stuff that touched
this area of the code, so I'm mostly behind its maintenance these
days.

> My guess, if to swap Step #1 with Step #2 such error will disappear
> because transactions will be filtered when comparing xid with
> ShmemVariableCache->nextXid before accessing clog. The function will
> be more robust. In general, it works but I'm not sure that such
> logic will not break some rare boundary cases. Another solution is
> to catch and ignore such error, but the original solution is the
> simpler one. I appreciate any thoughts concerning this topic. May
> be, you know some cases when such change in logic is not relevant?

Hmm. Historically, up to 9.6, StandbyRecoverPreparedTransactions()
and RecoverPreparedTransactions() have been in charge of doing the
commit/abort checks with potentially clog lookups when it came to past
files. PrescanPreparedTransactions() has been in charge of doing the
check about future files.

If you look closely, PrescanPreparedTransactions() is called in two
code paths before StandbyRecoverPreparedTransactions(), meaning that
we did the checks for the future files first, then the past checks
with potential CLOG lookups.

In v10~, things have changed to use ProcessTwoPhaseBuffer(), and as
you say, the checks are what we have now: they're done in a reverse
order. So, yes, I agree that you have a point here: the past code was
safer because it would have been able to avoid clog lookups if we had
a 2PC file in the future for a reason or another so it was safer than
what we have now.

I'm wondering if the attached patch is something worth backpatching,
actually, as these failing CLOG lookups can lead to spurious failures
at startup. HEAD-only would be OK based on the odds so that's a
no-brainer. Vitaly, have you seen that in the wild as an effect of
future 2PC files?

Any opinions from others?
--
Michael

Attachment Content-Type Size
twophase-process-checks.patch text/x-diff 1.5 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Japin Li 2024-12-25 05:15:53 Re: Parametrization minimum password lenght
Previous Message Andrei Lepikhov 2024-12-25 04:56:24 Short-living Memory Context in the optimiser