Re: Conflict detection for update_deleted in logical replication

From: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Conflict detection for update_deleted in logical replication
Date: 2024-09-10 11:24:45
Message-ID: CAA4eK1+jyC5toBww2PSqKPcV38PT2y2THixYMLxpNLZxy_chzA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Sep 5, 2024 at 5:07 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> ---
> ISSUES and SOLUTION
> ---
>
> To detect update_deleted conflicts, we need to search for dead tuples in the
> table. However, dead tuples can be removed by VACUUM at any time. Therefore, to
> ensure consistent and accurate conflict detection, tuples deleted by other
> origins must not be removed by VACUUM before the conflict detection process. If
> the tuples are removed prematurely, it might lead to incorrect conflict
> identification and resolution, causing data divergence between nodes.
>
> Here is an example of how VACUUM could affect conflict detection and how to
> prevent this issue. Assume we have a bidirectional cluster with two nodes, A
> and B.
>
> Node A:
> T1: INSERT INTO t (id, value) VALUES (1,1);
> T2: DELETE FROM t WHERE id = 1;
>
> Node B:
> T3: UPDATE t SET value = 2 WHERE id = 1;
>
> To retain the deleted tuples, the initial idea was that once transaction T2 had
> been applied to both nodes, there was no longer a need to preserve the dead
> tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
> concurrently, with T3 committing slightly earlier than T2. In this case, if
> Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
> Node A applies T3 after the VACUUM operation, it can only result in an
> update_missing conflict. Given that the default resolution for update_missing
> conflicts is apply_or_skip (e.g. convert update to insert if possible and apply
> the insert), Node A will eventually hold a row (1,2) while Node B becomes
> empty, causing data inconsistency.
>
> Therefore, the strategy needs to be expanded as follows: Node A cannot remove
> the dead tuple until:
> (a) The DELETE operation is replayed on all remote nodes, *AND*
> (b) The transactions on logical standbys occurring before the replay of Node
> A's DELETE are replayed on Node A as well.
>
> ---
> THE DESIGN
> ---
>
> To achieve the above, we plan to allow the logical walsender to maintain and
> advance the slot.xmin to protect the data in the user table and introduce a new
> logical standby feedback message. This message reports the WAL position that
> has been replayed on the logical standby *AND* the changes occurring on the
> logical standby before the WAL position are also replayed to the walsender's
> node (where the walsender is running). After receiving the new feedback
> message, the walsender will advance the slot.xmin based on the flush info,
> similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
> of logical slot are unused during logical replication, so I think it's safe and
> won't cause side-effect to reuse the xmin for this feature.
>
> We have introduced a new subscription option (feedback_slots='slot1,...'),
> where these slots will be used to check condition (b): the transactions on
> logical standbys occurring before the replay of Node A's DELETE are replayed on
> Node A as well. Therefore, on Node B, users should specify the slots
> corresponding to Node A in this option. The apply worker will get the oldest
> confirmed flush LSN among the specified slots and send the LSN as a feedback
> message to the walsender. -- I also thought of making it an automaic way, e.g.
> let apply worker select the slots that acquired by the walsenders which connect
> to the same remote server(e.g. if apply worker's connection info or some other
> flags is same as the walsender's connection info). But it seems tricky because
> if some slots are inactive which means the walsenders are not there, the apply
> worker could not find the correct slots to check unless we save the host along
> with the slot's persistence data.
>
> The new feedback message is sent only if feedback_slots is not NULL.
>

Don't you need to deal with versioning stuff for sending this new
message? I mean what if the receiver side of this message is old and
doesn't support this new message.

One minor comment on 0003
=======================
1.
get_slot_confirmed_flush()
{
...
+ /*
+ * To prevent concurrent slot dropping and creation while filtering the
+ * slots, take the ReplicationSlotControlLock outside of the loop.
+ */
+ LWLockAcquire(ReplicationSlotControlLock, LW_SHARED);
+
+ foreach_ptr(String, name, MySubscription->feedback_slots)
+ {
+ XLogRecPtr confirmed_flush;
+ ReplicationSlot *slot;
+
+ slot = ValidateAndGetFeedbackSlot(strVal(name));

Why do we need to validate slots each time here? Isn't it better to do it once?

--
With Regards,
Amit Kapila.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nazir Bilal Yavuz 2024-09-10 11:35:46 Re: Use read streams in pg_visibility
Previous Message Zhijie Hou (Fujitsu) 2024-09-10 11:00:01 RE: Conflict detection for update_deleted in logical replication