Re: Slow catchup of 2PC (twophase) transactions on replica in LR

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: Vitaly Davydov <v(dot)davydov(at)postgrespro(dot)ru>, Ajin Cherian <itsajin(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Slow catchup of 2PC (twophase) transactions on replica in LR
Date: 2024-07-08 11:55:01
Message-ID: CAA4eK1+FRrL_fLWLsWQGHZRESg39ixzDX_S9hU8D7aFtU+a8uQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Jul 8, 2024 at 12:34 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> > Another possible problem is related to my use case. I haven't reproduced this
> > case, just some thoughts. I guess, when two_phase is ON, the PREPARE statement
> > may be truncated from the WAL at checkpoint, but COMMIT PREPARED is still kept
> > in the WAL. On catchup, I would ask the master to send transactions from some
> > restart LSN. I would like to get all such transactions competely, with theirs
> > bodies, not only COMMIT PREPARED messages.
>
> I don't think it is a real issue. WALs for prepared transactions will retain
> until they are committed/aborted.
> When the two_phase is on and transactions are PREPAREd, they will not be
> cleaned up from the memory (See ReorderBufferProcessTXN()). Then, RUNNING_XACT
> record leads to update the restart_lsn of the slot but it cannot be move forward
> because ReorderBufferGetOldestTXN() returns the prepared transaction (See
> SnapBuildProcessRunningXacts()). restart_decoding_lsn of each transaction, which
> is a candidate of restart_lsn of the slot. is always behind the startpoint of
> its txn.
>

I see that in 0003/0004, the patch first aborts pending prepared
transactions, update's catalog, and then change slot's property via
walrcv_alter_slot. What if there is any ERROR (say the remote node is
not reachable or there is an error while updating the catalog) after
we abort the pending prepared transaction? Won't we end up with lost
prepared transactions in such a case?

Few other comments:
=================
The code to handle SUBOPT_TWOPHASE_COMMIT should be after failover
option handling for the sake of code symmetry. Also, the checks should
be in same order like first for slot_name, then enabled, then for
PreventInTransactionBlock(), after those, we can have other checks for
two_phase. If possible, we can move common checks in both failover and
two_phase options into a common function.

What should be the behavior if one tries to set slot_name to NONE and
also tries to toggle two_pahse option? I feel both options together
don't makes sense because there is no use in changing two_phase for
some slot which we are disassociating the subscription from. The same
could be said for the failover option as well, so if we agree with
some different behavior here, we can follow the same for failover
option as well.

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2024-07-08 12:00:00 Re: ssl tests fail due to TCP port conflict
Previous Message Amit Langote 2024-07-08 11:48:32 Re: Doc Rework: Section 9.16.13 SQL/JSON Query Functions