From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
Cc: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Fix slot synchronization with two_phase decoding enabled |
Date: | 2025-03-27 06:29:24 |
Message-ID: | CAA4eK1KqTKKGFA1_m-JbOfrGoGHy9=ZM5TX+B92jEzZf7NatuA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Mar 25, 2025 at 12:14 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Tue, Mar 25, 2025 at 11:05 AM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > Hi,
> >
> > When testing the slot synchronization with logical replication slots that
> > enabled two_phase decoding, I found that transactions prepared before two-phase
> > decoding is enabled may fail to replicate to the subscriber after being
> > committed on a promoted standby following a failover.
> >
> > To reproduce this issue, please follow these steps (also detailed in the
> > attached TAP test, v1-0001):
> >
> > 1. sub: create a subscription with (two_phase = false)
> > 2. primary (pub): prepare a txn A.
> > 3. sub: alter subscription set (two_phase = true) and wait for the logical slot to
> > be synced to standby.
> > 4. primary (pub): stop primary, promote the standby and let the subscriber use
> > the promoted standby as publisher.
> > 5. promoted standby (pub): COMMIT PREPARED A;
> > 6. sub: the apply worker will report the following ERROR because it didn't
> > receive the PREPARE.
> > ERROR: prepared transaction with identifier "pg_gid_16387_752" does not exist
> >
> > I think the root cause of this issue is that the two_phase_at field of the
> > slot, which indicates the LSN from which two-phase decoding is enabled (used to
> > prevent duplicate data transmission for prepared transactions), is not
> > synchronized to the standby server.
> >
> > In step 3, transaction A is not immediately replicated because it occurred
> > before enabling two-phase decoding. Thus, the prepared transaction should only
> > be replicated after decoding the final COMMIT PREPARED, as referenced in
> > ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on the
> > standby, the prepared transaction fails to send at that time.
> >
> > This problem arises after the support for altering the two-phase option
> > (1462aad).
> >
I suspect that this can happen in PG17 as well, but I need to think
more about it to make a reproducible test case.
In the meantime, I have a few minor comments on the proposed patches:
1.
##################################################
# Promote the standby1 to primary. Confirm that:
# a) the slot 'lsub1_slot' and 'snap_test_slot' are retained on the new primary
# b) logical replication for regress_mysub1 is resumed successfully
after failover
# c) changes can be consumed from the synced slot 'snap_test_slot'
##################################################
-$standby1->start;
$primary->wait_for_replay_catchup($standby1);
# Capture the time before the standby is promoted
@@ -885,6 +940,15 @@ $standby1->wait_for_catchup('regress_mysub1');
is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}),
"20", 'data replicated from the new primary');
+# Commit the prepared transaction
+$standby1->safe_psql('postgres',
+ "COMMIT PREPARED 'test_twophase_slotsync';");
+$standby1->wait_for_catchup('regress_mysub1');
+
+# Confirm that the prepared transaction is replicated to the subscriber
+is($subscriber1->safe_psql('postgres', q{SELECT count(*) FROM tab_int;}),
+ "21", 'prepared data replicated from the new primary');
The commentary above this test should include information about
verifying the replication of previously prepared transactions after
promotion. Also, it would be better if confirm the commit prepared
before confirming the new Insert is replicated after promotion.
2.
@@ -249,6 +250,7 @@ update_local_synced_slot(RemoteSlot *remote_slot,
Oid remote_dbid,
SpinLockAcquire(&slot->mutex);
slot->data.restart_lsn = remote_slot->restart_lsn;
slot->data.confirmed_flush = remote_slot->confirmed_lsn;
+ slot->data.two_phase_at = remote_slot->two_phase_at;
Why do we need to update the two_phase_at here when the patch does it
later in this function when local and remote values don't match?
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2025-03-27 06:45:51 | Re: Enhance 'pg_createsubscriber' to retrieve databases automatically when no database is provided. |
Previous Message | wenhui qiu | 2025-03-27 06:29:17 | Re: Use CLOCK_MONOTONIC_COARSE for instr_time when available |