Re: Found issues related with logical replication and 2PC

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>
Cc: "pgsql-hackers(at)lists(dot)postgresql(dot)org" <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Found issues related with logical replication and 2PC
Date: 2024-08-08 04:52:56
Message-ID: CAA4eK1+SRn0zOnJqBixJZSNF4V0cp2xw47LSY8yRGcx2QqLJkQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Jul 24, 2024 at 12:25 PM Hayato Kuroda (Fujitsu)
<kuroda(dot)hayato(at)fujitsu(dot)com> wrote:
>
> While creating a patch which allows ALTER SUBSCRIPTION SET (two_phase) [1],
> we found some issues related with logical replication and two_phase. I think this
> can happen not only HEAD but PG14+, but for now I shared patches for HEAD.
>
> Issue #1
>
> When handling a PREPARE message, the subscriber mistook the wrong lsn position
> (the end position of the last commit) as the end position of the current prepare.
> This can be fixed by adding a new global variable to record the end position of
> the last prepare. 0001 patch fixes the issue.
>
> Issue #2
>
> When the subscriber enables two-phase commit but doesn't set max_prepared_transaction >0
> and a transaction is prepared on the publisher, the apply worker reports an ERROR
> on the subscriber. After that, the prepared transaction is not replayed, which
> means it's lost forever. Attached script can emulate the situation.
>
> --
> ERROR: prepared transactions are disabled
> HINT: Set "max_prepared_transactions" to a nonzero value.
> --
>
> The reason is that we advanced the origin progress when aborting the
> transaction as well (RecordTransactionAbort->replorigin_session_advance). So,
> after setting replorigin_session_origin_lsn, if any ERROR happens when preparing
> the transaction, the transaction aborts which incorrectly advances the origin lsn.
>
> An easiest fix is to reset session replication origin before calling the
> RecordTransactionAbort(). I think this can happen when 1) LogicalRepApplyLoop()
> raises an ERROR or 2) apply worker exits. 0002 patch fixes the issue.
>

Can we start a separate thread to issue 2? I understand that this one
is also related to two_phase but since both are different issues it is
better to discuss in separate threads. This will also help us refer to
the discussion in future if required.

BTW, why did the 0002 patch change the below code:
--- a/src/include/replication/worker_internal.h
+++ b/src/include/replication/worker_internal.h
@@ -164,7 +164,8 @@ typedef struct ParallelApplyWorkerShared

/*
* XactLastCommitEnd or XactLastPrepareEnd from the parallel apply worker.
- * This is required by the leader worker so it can update the lsn_mappings.
+ * This is required by the leader worker so it can update the
+ * lsn_mappings.
*/
XLogRecPtr last_commit_end;

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-08-08 05:07:18 [bug fix] prepared transaction might be lost when max_prepared_transactions is zero on the subscriber
Previous Message Pavel Stehule 2024-08-08 04:44:31 Re: pgsql: Introduce hash_search_with_hash_value() function