RE: Fix slot synchronization with two_phase decoding enabled

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>
Subject: RE: Fix slot synchronization with two_phase decoding enabled
Date: 2025-04-17 12:44:16
Message-ID: OS0PR01MB5716B44052000EB91EFAE60E94BC2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

The recently added tests for slotsync on two-phase enabled slots failed[1] due
to a broken assertion triggered while decoding the COMMIT PREPARED record on
the promoted standby.

-----
Analysis
-----

if ((txn->final_lsn < two_phase_at) && is_commit)
{
/*
* txn must have been marked as a prepared transaction and skipped but
* not sent a prepare. Also, the prepare info must have been updated
* in txn even if we skip prepare.
*/
Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) ==
(RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));

I think the issue stem from the PREPARED transaction, which was skipped on the
primary, not being skipped on the promoted standby. The root cause appears to
be the latest 'confirmed_flush' (0/6005118) of the source slot not being synced
to the standby. This results in the confirmed_flush (0/6004F90) of the synced
slot being less than the synced two_phase_at field (0/6005118). Consequently,
the PREPARE record's LSN exceeds the synced confirmed_flush during
decoding, causing it to be incorrectly decoded and sent to the subscriber. Thus, when
decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but
wasn't skipped.

-- The LOGs that proves the confirmed_flush is not synced:

decoding on original slot (primary):
LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118
...
DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8.

decoding on synced slot (promted standby) - failed run:
DETAIL: Streaming transactions committing after 0/6004F90, reading WAL from 0/60049C8.

decoding on synced slot (promted standby) - success run:
LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118
...
DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8.
--

The logs above clearly show that during the test failure, the starting position
(6004F90) was less than that of a successful run. Additionally, it was lower
than the starting position of the remote slot on the primary, indicating a
failure to synchronize confirmed_lsn.

The reason confirmed_flush isn't synced should stem from the following check,
which skip the syncing of the confirmed_flush value. I also reproduced the
assertion failure by creating a scenario that leads to sync omission due to
this check:

/*
* Don't overwrite if we already have a newer catalog_xmin and
* restart_lsn.
*/
if (remote_slot->restart_lsn < slot->data.restart_lsn ||
TransactionIdPrecedes(remote_slot->catalog_xmin,
slot->data.catalog_xmin))
{

This check triggered because the synced catalog_xmin surpassed that of the
source slot (It can be seen from the log that the restart_lsn was synced to the
same value) ). This situation arises due to several factors:

On the publisher(primary), the slot may retain an older catalog_xmin and
restart_lsn because it is being consumed by a walsender. In this scenario, if
the apply worker does not immediately confirm that changes have been flushed,
the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old
value for candidate_catalog_xmin/candidate_restart_lsn and would not update
them until the apply worker confirms via LogicalConfirmReceivedLocation.

However, the slotsync worker has the opportunity to begin WAL reading from a
more recent point, potentially advancing catalog_xmin/restart_lsn to newer
values.

And it's also possible to advance only catalog_xmin while keep restart_lsn
unchanged, by starting a transaction before the running_xact record. In this
case, it would pass builder->last_serialized_snapshot as restart_lsn to
LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would
point to the position of the last running_xact record instead of the current
one, or the value would be NULL if no snapshot has been serialized before, so
it would not advance restart_lsn. The advancement of catalog_xmin is not
blocked by running transaction, as it would either use the XID of the running
transaction or the oldestRunningXid as candidate.

-----
Reproduction
-----

Accompanying this analysis is a script to reproduce the issue. In these
scripts, two running_xacts were logged post-slot creation on the publisher.
Below is a detailed progression of restart_lsn/catalog_xmin advancement
observed on both the publisher and standby:

The process is : slot creation -> log running_xact -> prepare a txn -> log
running_xact

The process of advancing the catalog_xmin/restart_lsn on publisher:

slot creation
...
1. 0/301B880 - running_xacts (no running txn, oldestrunningxid = 755)

got new catalog xmin 755 at 0/301B880
LOG: got new restart lsn 0/301B880 at 0/301B880
2. 0/301B8B8 - PREPAPRE a TRANSACTION.
3. 0/301BA20 - running_xacts (no running txn, oldestrunningxid = 756)

failed to increase restart lsn: proposed 0/301B880, after 0/301BA20,
current candidate 0/301B880, current after 0/301B880, flushed up to
0/301B810

final slot data:

catalog_xmin | 755
restart_lsn | 0/301B880
confirmed_flush_lsn | 0/301BAC8

The process of advancing the catalog_xmin/restart_lsn on standby (during slot
sync):

slot creation
...
1. 0/301B880 - running_xacts (no running txn, oldestrunningxid = 755)

building consistent snapshot, so will skip advancing the catalog_xmin/restart_lsn
...
2. 0/301BA20 - running_xacts (no running txn, oldestrunningxid = 756)

got new catalog xmin 756 at 0/301BA20

cannot get new restart_lsn as running txn exists and didn't serialize
snapshot yet.

first synced slot data:

catalog_xmin | 756
restart_lsn | 0/301B880
confirmed_flush_lsn | 0/301BAC8

The second slot sync cycle would skip updating confirmed_lsn because
catalog_xmin is newer.

-----
Fix
-----

I think we should keep the confirmed_flush even if the previous synced
restart_lsn/catalog_xmin is newer. Attachments include a patch for the same.

Best Regards,
Hou zj

Attachment Content-Type Size
test_two_phase.sh.txt text/plain 5.1 KB
0001-Fix-assertion-failure-when-decoding-synced-two-phase.patch application/octet-stream 2.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Zhijie Hou (Fujitsu) 2025-04-17 12:46:05 RE: Fix slot synchronization with two_phase decoding enabled
Previous Message Rahila Syed 2025-04-17 12:25:04 Re: Align memory context level numbering in pg_log_backend_memory_contexts()