RE: Conflict detection for update_deleted in logical replication

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Conflict detection for update_deleted in logical replication
Date: 2024-09-24 03:32:33
Message-ID: OS0PR01MB5716E0A283D1B66954CDF5A694682@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
>
> Thank you for considering another idea.

Thanks for reviewing the idea!

>
> On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> >
> > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > >
> > > Apart from the vacuum_defer_cleanup_age idea.
> > >
> >
> > I think you meant to say vacuum_committs_age idea.
> >
> > > we’ve given more thought to our
> > > approach for retaining dead tuples and have come up with another idea
> that can
> > > reliably detect conflicts without requiring users to choose a wise value for
> > > the vacuum_committs_age. This new idea could also reduce the
> performance
> > > impact. Thanks a lot to Amit for off-list discussion.
> > >
> > > The concept of the new idea is that, the dead tuples are only useful to
> detect
> > > conflicts when applying *concurrent* transactions from remotes. Any
> subsequent
> > > UPDATE from a remote node after removing the dead tuples should have a
> later
> > > timestamp, meaning it's reasonable to detect an update_missing scenario
> and
> > > convert the UPDATE to an INSERT when applying it.
> > >
> > > To achieve above, we can create an additional replication slot on the
> > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > the dead tuples. The apply worker will advance the slot.xmin after
> confirming
> > > that all the concurrent transaction on publisher has been applied locally.
>
> The replication slot used for this purpose will be a physical one or
> logical one? And IIUC such a slot doesn't need to retain WAL but if we
> do that, how do we advance the LSN of the slot?

I think it would be a logical slot. We can keep the
restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
WALs for decoding purpose.

>
> > > 2) the apply worker send a new message to walsender to request the latest
> wal
> > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> message or
> > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > keepalive message to add a 'request_wal_position' value)
>
> The apply worker sends a keepalive message when it didn't receive
> anything more than wal_receiver_timeout / 2. So in a very active
> system, we cannot rely on piggybacking new information to the
> keepalive messages to get the latest remote flush LSN.

Right. I think we need to send this new message at some interval independent of
wal_receiver_timeout.

>
> > > 3) The apply worker can continue to apply changes. After applying all the
> WALs
> > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > slot.xmin to 'candidate_xmin'.
> > >
> > > This approach ensures that dead tuples are not removed until all
> concurrent
> > > transactions have been applied. It can be effective for both bidirectional
> and
> > > non-bidirectional replication cases.
> > >
> > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > control whether this feature is enabled. Each subscription intending to
> detect
> > > update-delete conflicts should set retain_dead_tuples to true.
> > >
>
> I'm still studying this idea but let me confirm the following scenario.
>
> Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> XIDs and commit LSNs of T2 and T3 are the following:
>
> Node A
> T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
>
> Node B
> T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> commit-LSN:5000
>
> Further suppose that it's now 10:05 AM, and the latest XID and the
> latest flush WAL position of Node-A and Node-B are following:
>
> Node A
> current XID: 300
> latest flush LSN; 3000
>
> Node B
> current XID: 700
> latest flush LSN: 7000
>
> Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> (i.e., the logical replication is delaying for 5 min).
>
> Consider the following scenario:
>
> 1. The apply worker on Node-A calls GetRunningTransactionData() and
> gets 301 (set as candidate_xmin).
> 2. The apply worker on Node-A requests the latest WAL flush position
> from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> 4. The apply worker on Node-A continues applying changes, and applies
> the transactions up to remote (commit) LSN 7100.
> 5. Now that the apply worker on Node-A applied all changes smaller
> than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> 301 (candidate_xmin).
> 6. On Node-A, vacuum runs and physically removes the tuple that was
> deleted by T2.
>
> Here, on Node-B, there might be a transition between LSN 7100 and 8000
> that might require the tuple that is deleted by T2.
>
> For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> On Node-A, whether we detect "update_deleted" or "update_missing"
> still depends on when vacuum removes the tuple deleted by T2.

I think in this case, no matter we detect "update_delete" or "update_missing",
the final data is the same. Because T4's commit timestamp should be later than
T2 on node A, so in the case of "update_deleted", it will compare the commit
timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
which means we will convert the update into insert and apply. Even if the
deleted tuple is deleted and "update_missing" is detected, the update will
still be converted into insert and applied. So, the result is the same.

>
> If applying T4 raises an "update_missing" (i.e. the changes are
> applied in the order of T2->T3->(vacuum)->T4), it converts into an
> insert, resulting in the table having a row with value = 3.
>
> If applying T4 raises an "update_deleted" (i.e. the changes are
> applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
> in the table having no row.
>
> On the other hand, in this scenario, Node-B applies changes in the
> order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
> resulting in the table having a row with val=3 (assuming
> latest_committs_win is the default resolver for this confliction).
>
> Please confirm this scenario as I might be missing something.

As explained above, I think the data can be consistent in this case as well.

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2024-09-24 03:45:31 Re: Add contrib/pg_logicalsnapinspect
Previous Message jian he 2024-09-24 03:22:00 Re: not null constraints, again