Conflict detection for update_deleted in logical replication

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Conflict detection for update_deleted in logical replication
Date: 2024-09-05 11:36:38
Message-ID: OS0PR01MB5716BE80DAEB0EE2A6A5D1F5949D2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


Hi hackers,

I am starting a new thread to discuss and propose the conflict detection for
update_deleted scenarios during logical replication. This conflict occurs when
the apply worker cannot find the target tuple to be updated, as the tuple might
have been removed by another origin.

---
BACKGROUND
---

Currently, when the apply worker cannot find the target tuple during an update,
an update_missing conflict is logged. However, to facilitate future automatic
conflict resolution, it has been agreed[1][2] that we need to detect both
update_missing and update_deleted conflicts. Specifically, we will detect an
update_deleted conflict if any dead tuple matching the old key value of the
update operation is found; otherwise, it will be classified as update_missing.

Detecting both update_deleted and update_missing conflicts is important for
achieving eventual consistency in a bidirectional cluster, because the
resolution for each conflict type can differs. For example, for an
update_missing conflict, a feasible solution might be converting the update to
an insert and applying it. While for an update_deleted conflict, the preferred
approach could be to skip the update or compare the timestamps of the delete
transactions with the remote update transaction's and choose the most recent
one. For additional context, please refer to [3], which gives examples about
how these differences could lead to data divergence.

---
ISSUES and SOLUTION
---

To detect update_deleted conflicts, we need to search for dead tuples in the
table. However, dead tuples can be removed by VACUUM at any time. Therefore, to
ensure consistent and accurate conflict detection, tuples deleted by other
origins must not be removed by VACUUM before the conflict detection process. If
the tuples are removed prematurely, it might lead to incorrect conflict
identification and resolution, causing data divergence between nodes.

Here is an example of how VACUUM could affect conflict detection and how to
prevent this issue. Assume we have a bidirectional cluster with two nodes, A
and B.

Node A:
T1: INSERT INTO t (id, value) VALUES (1,1);
T2: DELETE FROM t WHERE id = 1;

Node B:
T3: UPDATE t SET value = 2 WHERE id = 1;

To retain the deleted tuples, the initial idea was that once transaction T2 had
been applied to both nodes, there was no longer a need to preserve the dead
tuple on Node A. However, a scenario arises where transactions T3 and T2 occur
concurrently, with T3 committing slightly earlier than T2. In this case, if
Node B applies T2 and Node A removes the dead tuple (1,1) via VACUUM, and then
Node A applies T3 after the VACUUM operation, it can only result in an
update_missing conflict. Given that the default resolution for update_missing
conflicts is apply_or_skip (e.g. convert update to insert if possible and apply
the insert), Node A will eventually hold a row (1,2) while Node B becomes
empty, causing data inconsistency.

Therefore, the strategy needs to be expanded as follows: Node A cannot remove
the dead tuple until:
(a) The DELETE operation is replayed on all remote nodes, *AND*
(b) The transactions on logical standbys occurring before the replay of Node
A's DELETE are replayed on Node A as well.

---
THE DESIGN
---

To achieve the above, we plan to allow the logical walsender to maintain and
advance the slot.xmin to protect the data in the user table and introduce a new
logical standby feedback message. This message reports the WAL position that
has been replayed on the logical standby *AND* the changes occurring on the
logical standby before the WAL position are also replayed to the walsender's
node (where the walsender is running). After receiving the new feedback
message, the walsender will advance the slot.xmin based on the flush info,
similar to the advancement of catalog_xmin. Currently, the effective_xmin/xmin
of logical slot are unused during logical replication, so I think it's safe and
won't cause side-effect to reuse the xmin for this feature.

We have introduced a new subscription option (feedback_slots='slot1,...'),
where these slots will be used to check condition (b): the transactions on
logical standbys occurring before the replay of Node A's DELETE are replayed on
Node A as well. Therefore, on Node B, users should specify the slots
corresponding to Node A in this option. The apply worker will get the oldest
confirmed flush LSN among the specified slots and send the LSN as a feedback
message to the walsender. -- I also thought of making it an automaic way, e.g.
let apply worker select the slots that acquired by the walsenders which connect
to the same remote server(e.g. if apply worker's connection info or some other
flags is same as the walsender's connection info). But it seems tricky because
if some slots are inactive which means the walsenders are not there, the apply
worker could not find the correct slots to check unless we save the host along
with the slot's persistence data.

The new feedback message is sent only if feedback_slots is not NULL. If the
slots in feedback_slots are removed, a final message containing
InvalidXLogRecPtr will be sent to inform the walsender to forget about the
slot.xmin.

To detect update_deleted conflicts during update operations, if the target row
cannot be found, we perform an additional scan of the table using snapshotAny.
This scan aims to locate the most recently deleted row that matches the old
column values from the remote update operation and has not yet been removed by
VACUUM. If any such tuples are found, we report the update_deleted conflict
along with the origin and transaction information that deleted the tuple.

Please refer to the attached POC patch set which implements above design. The
patch set is split into some parts to make it easier for the initial review.
Please note that each patch is interdependent and cannot work independently.

Thanks a lot to Kuroda-San and Amit for the off-list discussion.

Suggestions and comments are highly appreciated !

[1]https://www.postgresql.org/message-id/CAJpy0uCov4JfZJeOvY0O21_gk9bcgNUDp4jf8%2BBbMp%2BEAv8cVQ%40mail.gmail.com
[2]https://www.postgresql.org/message-id/CAA4eK1Lj-PWrP789KnKxZydisHajd38rSihWXO8MVBLDwxG1Kg%40mail.gmail.com
[3]https://www.postgresql.org/message-id/CAJpy0uC6Zs5WwwiyuvG_kEB6Q3wyDWpya7PXm3SMT_YG%3DXJJ1w%40mail.gmail.com

Best Regards,
Hou Zhijie

Attachment Content-Type Size
v21-0006-Add-a-tap-test-to-verify-the-new-slot-xmin-mecha.patch application/octet-stream 8.7 KB
v21-0001-Maintain-and-Advance-slot.xmin-in-logical-walsen.patch application/octet-stream 11.0 KB
v21-0002-Add-a-subscription-option-feedback_slots.patch application/octet-stream 60.9 KB
v21-0003-Send-the-slot-flush-feedback-message-via-apply-w.patch application/octet-stream 12.2 KB
v21-0004-Support-the-conflict-detection-for-update_delete.patch application/octet-stream 20.8 KB
v21-0005-Support-copying-xmin-value-of-slots-during-slots.patch application/octet-stream 7.1 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Eisentraut 2024-09-05 12:09:49 Re: SQL:2011 application time
Previous Message Daniel Gustafsson 2024-09-05 11:32:34 Re: optimizing pg_upgrade's once-in-each-database steps