Re: Conflict detection for update_deleted in logical replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Conflict detection for update_deleted in logical replication
Date: 2024-09-24 06:42:15
Message-ID: CAD21AoDpfmtkePDYUGqD=-JG9vK16aYocgD=2K+zCracUP0O2A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Sep 23, 2024 at 8:32 PM Zhijie Hou (Fujitsu)
<houzj(dot)fnst(at)fujitsu(dot)com> wrote:
>
> On Tuesday, September 24, 2024 5:05 AM Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> wrote:
> >
> > Thank you for considering another idea.
>
> Thanks for reviewing the idea!
>
> >
> > On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> > > <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> > > >
> > > > Apart from the vacuum_defer_cleanup_age idea.
> > > >
> > >
> > > I think you meant to say vacuum_committs_age idea.
> > >
> > > > we’ve given more thought to our
> > > > approach for retaining dead tuples and have come up with another idea
> > that can
> > > > reliably detect conflicts without requiring users to choose a wise value for
> > > > the vacuum_committs_age. This new idea could also reduce the
> > performance
> > > > impact. Thanks a lot to Amit for off-list discussion.
> > > >
> > > > The concept of the new idea is that, the dead tuples are only useful to
> > detect
> > > > conflicts when applying *concurrent* transactions from remotes. Any
> > subsequent
> > > > UPDATE from a remote node after removing the dead tuples should have a
> > later
> > > > timestamp, meaning it's reasonable to detect an update_missing scenario
> > and
> > > > convert the UPDATE to an INSERT when applying it.
> > > >
> > > > To achieve above, we can create an additional replication slot on the
> > > > subscriber side, maintained by the apply worker. This slot is used to retain
> > > > the dead tuples. The apply worker will advance the slot.xmin after
> > confirming
> > > > that all the concurrent transaction on publisher has been applied locally.
> >
> > The replication slot used for this purpose will be a physical one or
> > logical one? And IIUC such a slot doesn't need to retain WAL but if we
> > do that, how do we advance the LSN of the slot?
>
> I think it would be a logical slot. We can keep the
> restart_lsn/confirmed_flush_lsn as invalid because we don't need to retain the
> WALs for decoding purpose.
>
> >
> > > > 2) the apply worker send a new message to walsender to request the latest
> > wal
> > > > flush position(GetFlushRecPtr) on publisher, and save it to
> > > > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback
> > message or
> > > > extend the existing keepalive message(e,g extends the requestReply bit in
> > > > keepalive message to add a 'request_wal_position' value)
> >
> > The apply worker sends a keepalive message when it didn't receive
> > anything more than wal_receiver_timeout / 2. So in a very active
> > system, we cannot rely on piggybacking new information to the
> > keepalive messages to get the latest remote flush LSN.
>
> Right. I think we need to send this new message at some interval independent of
> wal_receiver_timeout.
>
> >
> > > > 3) The apply worker can continue to apply changes. After applying all the
> > WALs
> > > > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > > > slot.xmin to 'candidate_xmin'.
> > > >
> > > > This approach ensures that dead tuples are not removed until all
> > concurrent
> > > > transactions have been applied. It can be effective for both bidirectional
> > and
> > > > non-bidirectional replication cases.
> > > >
> > > > We could introduce a boolean subscription option (retain_dead_tuples) to
> > > > control whether this feature is enabled. Each subscription intending to
> > detect
> > > > update-delete conflicts should set retain_dead_tuples to true.
> > > >
> >
> > I'm still studying this idea but let me confirm the following scenario.
> >
> > Suppose both Node-A and Node-B have the same row (1,1) in table t, and
> > XIDs and commit LSNs of T2 and T3 are the following:
> >
> > Node A
> > T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000
> >
> > Node B
> > T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500,
> > commit-LSN:5000
> >
> > Further suppose that it's now 10:05 AM, and the latest XID and the
> > latest flush WAL position of Node-A and Node-B are following:
> >
> > Node A
> > current XID: 300
> > latest flush LSN; 3000
> >
> > Node B
> > current XID: 700
> > latest flush LSN: 7000
> >
> > Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
> > (i.e., the logical replication is delaying for 5 min).
> >
> > Consider the following scenario:
> >
> > 1. The apply worker on Node-A calls GetRunningTransactionData() and
> > gets 301 (set as candidate_xmin).
> > 2. The apply worker on Node-A requests the latest WAL flush position
> > from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
> > 3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
> > 4. The apply worker on Node-A continues applying changes, and applies
> > the transactions up to remote (commit) LSN 7100.
> > 5. Now that the apply worker on Node-A applied all changes smaller
> > than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
> > 301 (candidate_xmin).
> > 6. On Node-A, vacuum runs and physically removes the tuple that was
> > deleted by T2.
> >
> > Here, on Node-B, there might be a transition between LSN 7100 and 8000
> > that might require the tuple that is deleted by T2.
> >
> > For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
> > executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
> > On Node-A, whether we detect "update_deleted" or "update_missing"
> > still depends on when vacuum removes the tuple deleted by T2.
>
> I think in this case, no matter we detect "update_delete" or "update_missing",
> the final data is the same. Because T4's commit timestamp should be later than
> T2 on node A, so in the case of "update_deleted", it will compare the commit
> timestamp of the deleted tuple's xmax with T4's timestamp, and T4 should win,
> which means we will convert the update into insert and apply. Even if the
> deleted tuple is deleted and "update_missing" is detected, the update will
> still be converted into insert and applied. So, the result is the same.

The "latest_timestamp_wins" is the default resolution method for
"update_deleted"? When I checked the wiki page[1], the "skip" was the
default solution method for that.

Regards,

[1] https://wiki.postgresql.org/wiki/Conflict_Detection_and_Resolution#Defaults

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Daniel Gustafsson 2024-09-24 06:48:06 Re: Cleaning up ERRCODE usage in our XML code
Previous Message Yugo NAGATA 2024-09-24 06:05:50 Re: pgbench: Improve result outputs related to failed transactinos