From: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
---|---|
To: | PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Fix slot synchronization with two_phase decoding enabled |
Date: | 2025-03-25 05:35:32 |
Message-ID: | TYAPR01MB5724CC7C288535BBCEEE65DA94A72@TYAPR01MB5724.jpnprd01.prod.outlook.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
When testing the slot synchronization with logical replication slots that
enabled two_phase decoding, I found that transactions prepared before two-phase
decoding is enabled may fail to replicate to the subscriber after being
committed on a promoted standby following a failover.
To reproduce this issue, please follow these steps (also detailed in the
attached TAP test, v1-0001):
1. sub: create a subscription with (two_phase = false)
2. primary (pub): prepare a txn A.
3. sub: alter subscription set (two_phase = true) and wait for the logical slot to
be synced to standby.
4. primary (pub): stop primary, promote the standby and let the subscriber use
the promoted standby as publisher.
5. promoted standby (pub): COMMIT PREPARED A;
6. sub: the apply worker will report the following ERROR because it didn't
receive the PREPARE.
ERROR: prepared transaction with identifier "pg_gid_16387_752" does not exist
I think the root cause of this issue is that the two_phase_at field of the
slot, which indicates the LSN from which two-phase decoding is enabled (used to
prevent duplicate data transmission for prepared transactions), is not
synchronized to the standby server.
In step 3, transaction A is not immediately replicated because it occurred
before enabling two-phase decoding. Thus, the prepared transaction should only
be replicated after decoding the final COMMIT PREPARED, as referenced in
ReorderBufferFinishPrepared(). However, due to the invalid two_phase_at on the
standby, the prepared transaction fails to send at that time.
This problem arises after the support for altering the two-phase option
(1462aad). Previously, two-phase was only enabled during slot creation, which
wait for all prepared transactions to finish (via ... -> SnapBuildWaitSnapshot)
before reaching a consistent state, so the bug didn't exist.
To address the issue, I propose synchronizing the two_phase_at field to the
standby server, as implemented in the attached patches. As mentioned
earlier,this bug exists only for PG18 so we do not need to back patch.
v1-0001: Tap test to reproduce the issue
I place this patch as the first one so that reviewers can run it
independently to reproduce the issue. Once the problem is thoroughly
understood and the fix proves stable, the patches can be integrated.
v1-0002: Display two_phase_at in the pg_replication_slots view
v1-0003: Sync the two_phase_at field of a replication slot to the standby
An alternative approach might be modifying ALTER_REPLICATION_SLOT to wait
for all prepared transactions to commit when enabling two-phase. However, this
appears inelegant and less user friendly.
Best Regards,
Hou zj
Attachment | Content-Type | Size |
---|---|---|
v1-0001-Tap-test-to-reproduce-the-issue.patch | application/octet-stream | 5.5 KB |
v1-0003-Sync-the-two_phase_at-field-of-a-replication-slot.patch | application/octet-stream | 4.4 KB |
v1-0002-Display-two_phase_at-in-the-pg_replication_slots-.patch | application/octet-stream | 5.4 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | vignesh C | 2025-03-25 05:35:46 | Re: Logical Replication of sequences |
Previous Message | Jeff Davis | 2025-03-25 05:32:12 | Re: Statistics Import and Export |