RE: Conflict detection for update_deleted in logical replication

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Hayato Kuroda (Fujitsu)" <kuroda(dot)hayato(at)fujitsu(dot)com>, Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Conflict detection for update_deleted in logical replication
Date: 2024-11-12 10:19:51
Message-ID: OS0PR01MB571628594B26B4CC2346F09294592@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Friday, October 18, 2024 5:45 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> On Tue, Oct 15, 2024 at 5:03 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > On Mon, Oct 14, 2024 at 9:09 AM Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > We thought of few options to fix this:
> > >
> > > 1) Add a Time-Based Subscription Option:
> > >
> > > We could add a new time-based option for subscriptions, such as
> > > retain_dead_tuples = '5s'. In the logical launcher, after obtaining
> > > the candidate XID, the launcher will wait for the specified time
> > > before advancing the slot.xmin. This ensures that deleted tuples are
> > > retained for at least the duration defined by this new option.
> > >
> > > This approach is designed to simulate the functionality of the GUC
> > > (vacuum_committs_age), but with a simpler implementation that does
> > > not impact vacuum performance. We can maintain both this time-based
> > > method and the current automatic method. If a user does not specify
> > > the time-based option, we will continue using the existing approach
> > > to retain dead tuples until all concurrent transactions from the remote node
> have been applied.
> > >
> > > 2) Modification to the Logical Walsender
> > >
> > > On the logical walsender, which is as a physical standby, we can
> > > build an additional connection to the physical primary to obtain the
> > > latest WAL position. This position will then be sent as feedback to
> > > the logical subscriber.
> > >
> > > A potential concern is that this requires the walsender to use the
> > > walreceiver API, which may seem a bit unnatural. And, it starts an
> > > additional walsender process on the primary, as the logical
> > > walsender on the physical standby will need to communicate with this
> walsender to fetch the WAL position.
> > >
> >
> > This idea is worth considering, but I think it may not be a good
> > approach if the physical standby is cascading. We need to restrict the
> > update_delete conflict detection, if the standby is cascading, right?
> >
> > The other approach is that we send current_timestamp from the
> > subscriber and somehow check if the physical standby has applied
> > commit_lsn up to that commit_ts, if so, it can send that WAL position
> > to the subscriber, otherwise, wait for it to be applied. If we do this
> > then we don't need to add a restriction for cascaded physical standby.
> > I think the subscriber anyway needs to wait for such an LSN to be
> > applied on standby before advancing the xmin even if we get it from
> > the primary. This is because the subscriber can only be able to apply
> > and flush the WAL once it is applied on the standby. Am, I missing
> > something?
> >
> > This approach has a disadvantage that we are relying on clocks to be
> > synced on both nodes which we anyway need for conflict resolution as
> > discussed in the thread [1]. We also need to consider the Commit
> > Timestamp and LSN inversion issue as discussed in another thread [2]
> > if we want to pursue this approach because we may miss an LSN that has
> > a prior timestamp.
> >

For the "publisher is also a standby" issue, I have modified the V8 patch to
report a warning in this case. As I personally feel this is not the main use case
for conflict detection, we can revisit it later after pushing the main patches
receiving some user feedback.

>
> The problem due to Commit Timestamp and LSN inversion is that the standby
> may not consider the WAL LSN from an earlier timestamp, which could lead to
> the removal of required dead rows on the subscriber.
>
> The other problem pointed out by Hou-San offlist due to Commit Timestamp
> and LSN inversion is that we could miss sending the WAL LSN that the
> subscriber requires to retain dead rows for update_delete conflict. For example,
> consider the following case 2 node, bidirectional setup:
>
> Node A:
> T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
> T2: DELETE FROM t WHERE id = 1; ts=10.02 AM
>
> Node B:
> T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM
>
> Say subscription is created with retain_dead_tuples = true/false
>
> After executing T2, the apply worker on Node A will check the latest wal flush
> location on Node B. Till that time, the T3 should have finished, so the xmin will
> be advanced only after applying the WALs that is later than T3. So, the dead
> tuple will not be removed before applying the T3, which means the
> update_delete can be detected.
>
> As there is a gap between when we acquire the commit_timestamp and the
> commit LSN, it is possible that T3 would have not yet flushed it's WAL even
> though it is committed earlier than T2. If this happens then we won't be able to
> detect update_deleted conflict reliably.
>
> Now, the one simpler idea is to acquire the commit timestamp and reserve WAL
> (LSN) under the same spinlock in
> ReserveXLogInsertLocation() but that could be costly as discussed in the
> thread [1]. The other more localized solution is to acquire a timestamp after
> reserving the commit WAL LSN outside the lock which will solve this particular
> problem.

Since the discussion of the WAL/LSN inversion issue is ongoing, I also thought
about another approach that can fix the issue independently. This idea is to
delay the non-removable xid advancement until all the remote concurrent
transactions that may have been assigned earlier timestamp have been committed.

The implementation is:

On the walsender, after receiving a request, it can send the oldest xid and
next xid along with the

In response, the apply worker can safely advance the non-removable XID if
oldest_committing_xid == nextXid, indicating that there is no race conditions.

Alternatively, if oldest_committing_xid != nextXid, the apply worker might send
a second request after some interval. If the subsequently obtained
oldest_committing_xid surpasses or equal to the initial nextXid, it indicates
that all previously risky transactions have committed, therefore the the
non-removable transaction ID can be advanced.

Attach the V8 patch set. Note that I put the new approach for above race
condition in a temp patch " v8-0001_2-Maintain-xxx.patch.txt", because the
approach may or may not be accepted based on the discussion in WAL/LSN
inversion thread.

>
> [1] -
> https://www.postgresql.org/message-id/CAJpy0uBxEJnabEp3JS%3Dn9X19V
> x2ZK3k5AR7N0h-cSMtOwYV3fA%40mail.gmail.com

Best Regards,
Hou zj

Attachment Content-Type Size
v8-0001_2-Maintain-the-oldest-non-removeable-tranasction-I.patch.txt text/plain 24.4 KB
v8-0001-Maintain-the-oldest-non-removeable-tranasction-ID.patch application/octet-stream 20.2 KB
v8-0002-Maintain-the-replication-slot-in-logical-launcher.patch application/octet-stream 11.4 KB
v8-0003-Support-the-conflict-detection-for-update_deleted.patch application/octet-stream 21.3 KB
v8-0004-Add-a-detect_update_deleted-option-to-subscriptio.patch application/octet-stream 65.2 KB
v8-0005-Add-a-tap-test-to-verify-the-new-slot-xmin-mechan.patch application/octet-stream 6.9 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Ashutosh Bapat 2024-11-12 10:26:39 Re: Separate memory contexts for relcache and catcache
Previous Message Peter Eisentraut 2024-11-12 10:18:03 Re: NOT ENFORCED constraint feature