From: | Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> |
---|---|
To: | "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com> |
Cc: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com> |
Subject: | Re: Fix slot synchronization with two_phase decoding enabled |
Date: | 2025-04-18 04:28:57 |
Message-ID: | CAA4eK1+q_qmp45weGeW4yBmHV4k-kiFYNFNH4oeg_N8vGjC+Ew@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Thu, Apr 17, 2025 at 6:14 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> Hi,
>
> The recently added tests for slotsync on two-phase enabled slots failed[1] due
> to a broken assertion triggered while decoding the COMMIT PREPARED record on
> the promoted standby.
>
> -----
> Analysis
> -----
>
> if ((txn->final_lsn < two_phase_at) && is_commit)
> {
> /*
> * txn must have been marked as a prepared transaction and skipped but
> * not sent a prepare. Also, the prepare info must have been updated
> * in txn even if we skip prepare.
> */
> Assert((txn->txn_flags & RBTXN_PREPARE_STATUS_MASK) ==
> (RBTXN_IS_PREPARED | RBTXN_SKIPPED_PREPARE));
>
> I think the issue stem from the PREPARED transaction, which was skipped on the
> primary, not being skipped on the promoted standby. The root cause appears to
> be the latest 'confirmed_flush' (0/6005118) of the source slot not being synced
> to the standby. This results in the confirmed_flush (0/6004F90) of the synced
> slot being less than the synced two_phase_at field (0/6005118). Consequently,
> the PREPARE record's LSN exceeds the synced confirmed_flush during
> decoding, causing it to be incorrectly decoded and sent to the subscriber. Thus, when
> decoding COMMIT PREPARED, the PREPARE record is found before two_phase_at but
> wasn't skipped.
>
> -- The LOGs that proves the confirmed_flush is not synced:
>
> decoding on original slot (primary):
> LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118
> ...
> DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8.
>
> decoding on synced slot (promted standby) - failed run:
> DETAIL: Streaming transactions committing after 0/6004F90, reading WAL from 0/60049C8.
>
> decoding on synced slot (promted standby) - success run:
> LOG: 0/6004F90 has been already streamed, forwarding to 0/6005118
> ...
> DETAIL: Streaming transactions committing after 0/6005118, reading WAL from 0/60049C8.
> --
>
> The logs above clearly show that during the test failure, the starting position
> (6004F90) was less than that of a successful run. Additionally, it was lower
> than the starting position of the remote slot on the primary, indicating a
> failure to synchronize confirmed_lsn.
>
> The reason confirmed_flush isn't synced should stem from the following check,
> which skip the syncing of the confirmed_flush value. I also reproduced the
> assertion failure by creating a scenario that leads to sync omission due to
> this check:
>
> /*
> * Don't overwrite if we already have a newer catalog_xmin and
> * restart_lsn.
> */
> if (remote_slot->restart_lsn < slot->data.restart_lsn ||
> TransactionIdPrecedes(remote_slot->catalog_xmin,
> slot->data.catalog_xmin))
> {
>
>
> This check triggered because the synced catalog_xmin surpassed that of the
> source slot (It can be seen from the log that the restart_lsn was synced to the
> same value) ). This situation arises due to several factors:
>
> On the publisher(primary), the slot may retain an older catalog_xmin and
> restart_lsn because it is being consumed by a walsender. In this scenario, if
> the apply worker does not immediately confirm that changes have been flushed,
> the walsender advances the catalog_xmin/restart_lsn slowly. It sets an old
> value for candidate_catalog_xmin/candidate_restart_lsn and would not update
> them until the apply worker confirms via LogicalConfirmReceivedLocation.
>
> However, the slotsync worker has the opportunity to begin WAL reading from a
> more recent point, potentially advancing catalog_xmin/restart_lsn to newer
> values.
>
> And it's also possible to advance only catalog_xmin while keep restart_lsn
> unchanged, by starting a transaction before the running_xact record. In this
> case, it would pass builder->last_serialized_snapshot as restart_lsn to
> LogicalIncreaseRestartDecodingForSlot(), but last_serialized_snapshot would
> point to the position of the last running_xact record instead of the current
> one, or the value would be NULL if no snapshot has been serialized before, so
> it would not advance restart_lsn. The advancement of catalog_xmin is not
> blocked by running transaction, as it would either use the XID of the running
> transaction or the oldestRunningXid as candidate.
>
Your analysis appears correct to me.
>
> -----
> Fix
> -----
>
> I think we should keep the confirmed_flush even if the previous synced
> restart_lsn/catalog_xmin is newer. Attachments include a patch for the same.
>
This will fix the case we are facing but adds a new rule for slot
synchronization. Can we think of a simpler way to fix this by avoiding
updating other slot fields (like two_phase, two_phase_at) if
restart_lsn or catalog_xmin of the local slot is ahead of the remote
slot?
--
With Regards,
Amit Kapila.
From | Date | Subject | |
---|---|---|---|
Next Message | Konstantin Knizhnik | 2025-04-18 07:06:23 | Re: Changing shared_buffers without restart |
Previous Message | David Rowley | 2025-04-18 04:13:02 | Re: n_ins_since_vacuum stats for aborted transactions |