RE: Conflict detection for update_deleted in logical replication

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Conflict detection for update_deleted in logical replication
Date: 2025-01-06 11:22:08
Message-ID: OS0PR01MB5716C8B3C364EEE86F91B3D294102@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, January 3, 2025 2:36 PM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:

Hi,

>
> I have one comment on the 0001 patch:

Thanks for the comments!

>
> + /*
> + * The changes made by this and later transactions are still
> non-removable
> + * to allow for the detection of update_deleted conflicts when
> applying
> + * changes in this logical replication worker.
> + *
> + * Note that this info cannot directly protect dead tuples from being
> + * prematurely frozen or removed. The logical replication launcher
> + * asynchronously collects this info to determine whether to advance
> the
> + * xmin value of the replication slot.
> + *
> + * Therefore, FullTransactionId that includes both the
> transaction ID and
> + * its epoch is used here instead of a single Transaction ID. This is
> + * critical because without considering the epoch, the transaction ID
> + * alone may appear as if it is in the future due to transaction ID
> + * wraparound.
> + */
> + FullTransactionId oldest_nonremovable_xid;
>
> The last paragraph of the comment mentions that we need to use
> FullTransactionId to properly compare XIDs even after the XID wraparound
> happens. But once we set the oldest-nonremovable-xid it prevents XIDs from
> being wraparound, no? I mean that workers'
> oldest-nonremovable-xid values and slot's non-removal-xid (i.e., its
> xmin) are never away from more than 2^31 XIDs.

I think the issue is that the launcher may create the replication slot after
the apply worker has already set the 'oldest_nonremovable_xid' because the
launcher are doing that asynchronously. So, Before the slot is created, there's
a window where transaction IDs might wrap around. If initially the apply worker
has computed a candidate_xid (755) and the xid wraparound before the launcher
creates the slot, causing the new current xid to be (740), then the old
candidate_xid(755) looks like a xid in the future, and the launcher could
advance the xmin to 755 which cause the dead tuples to be removed prematurely.
(We are trying to reproduce this to ensure that it's a real issue and will
share after finishing)

We thought of another approach, which is to create/drop this slot first as
soon as one enables/disables detect_update_deleted (E.g. create/drop slot
during DDL). But it seems complicate to control the concurrent slot
create/drop. For example, if one backend A enables detect_update_deteled, it
will create a slot. But if another backend B is disabling the
detect_update_deteled at the same time, then the newly created slot may be
dropped by backend B. I thought about checking the number of subscriptions that
enables detect_update_deteled before dropping the slot in backend B, but the
subscription changes caused by backend A may not visable yet (e.g. not
committed yet).

Does that make sense to you, or do you have some other ideas?

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ilia Evdokimov 2025-01-06 11:26:39 Re: Remove unused rel parameter in lookup_var_attr_stats
Previous Message Ashutosh Bapat 2025-01-06 11:20:14 Re: POC: enable logical decoding when wal_level = 'replica' without a server restart