Re: Conflict detection for update_deleted in logical replication

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Conflict detection for update_deleted in logical replication
Date: 2024-09-10 06:44:55
Message-ID: CAJpy0uBDzE6BBd8AzFkJzaORxdQRfcLogy7qrSKvtRS751L_kQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
>
> Hi hackers,
>
> I am starting a new thread to discuss and propose the conflict detection for
> update_deleted scenarios during logical replication. This conflict occurs when
> the apply worker cannot find the target tuple to be updated, as the tuple might
> have been removed by another origin.
>
> ---
> BACKGROUND
> ---
>
> Currently, when the apply worker cannot find the target tuple during an update,
> an update_missing conflict is logged. However, to facilitate future automatic
> conflict resolution, it has been agreed[1][2] that we need to detect both
> update_missing and update_deleted conflicts. Specifically, we will detect an
> update_deleted conflict if any dead tuple matching the old key value of the
> update operation is found; otherwise, it will be classified as update_missing.
>
> Detecting both update_deleted and update_missing conflicts is important for
> achieving eventual consistency in a bidirectional cluster, because the
> resolution for each conflict type can differs. For example, for an
> update_missing conflict, a feasible solution might be converting the update to
> an insert and applying it. While for an update_deleted conflict, the preferred
> approach could be to skip the update or compare the timestamps of the delete
> transactions with the remote update transaction's and choose the most recent
> one. For additional context, please refer to [3], which gives examples about
> how these differences could lead to data divergence.
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
> T1: INSERT INTO t (id, value) VALUES (1,1);
> T2: DELETE FROM t WHERE id = 1;
>
> Node B:
> T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL. If the
> slots in feedback_slots are removed, a final message containing
> InvalidXLogRecPtr will be sent to inform the walsender to forget about the
> slot.xmin.
>
> To detect update_deleted conflicts during update operations, if the target row
> cannot be found, we perform an additional scan of the table using snapshotAny.
> This scan aims to locate the most recently deleted row that matches the old
> column values from the remote update operation and has not yet been removed by
> VACUUM. If any such tuples are found, we report the update_deleted conflict
> along with the origin and transaction information that deleted the tuple.
>
> Please refer to the attached POC patch set which implements above design. The
> patch set is split into some parts to make it easier for the initial review.
> Please note that each patch is interdependent and cannot work independently.
>
> Thanks a lot to Kuroda-San and Amit for the off-list discussion.
>
> Suggestions and comments are highly appreciated !
>

Thank You Hou-San for explaining the design. But to make it easier to
understand, would you be able to explain the sequence/timeline of the
*new* actions performed by the walsender and the apply processes for
the given example along with new feedback_slot config needed

Node A: (Procs: walsenderA, applyA)
T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
T2: DELETE FROM t WHERE id = 1; ts=10.02 AM

Node B: (Procs: walsenderB, applyB)
T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-09-10 07:11:13 Re: GUC names in messages
Previous Message Jelte Fennema-Nio 2024-09-10 06:28:42 Re: First draft of PG 17 release notes