Re: Conflict detection for update_deleted in logical replication

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Conflict detection for update_deleted in logical replication
Date: 2024-09-23 21:05:17
Message-ID: CAD21AoCS84ofwi7OVzT7aEMvoE+gm3mZxuDVpDZw6OrccfbNkg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Thank you for considering another idea.

On Fri, Sep 20, 2024 at 2:46 AM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
>
> On Fri, Sep 20, 2024 at 8:25 AM Zhijie Hou (Fujitsu)
> <houzj(dot)fnst(at)fujitsu(dot)com> wrote:
> >
> > Apart from the vacuum_defer_cleanup_age idea.
> >
>
> I think you meant to say vacuum_committs_age idea.
>
> > we’ve given more thought to our
> > approach for retaining dead tuples and have come up with another idea that can
> > reliably detect conflicts without requiring users to choose a wise value for
> > the vacuum_committs_age. This new idea could also reduce the performance
> > impact. Thanks a lot to Amit for off-list discussion.
> >
> > The concept of the new idea is that, the dead tuples are only useful to detect
> > conflicts when applying *concurrent* transactions from remotes. Any subsequent
> > UPDATE from a remote node after removing the dead tuples should have a later
> > timestamp, meaning it's reasonable to detect an update_missing scenario and
> > convert the UPDATE to an INSERT when applying it.
> >
> > To achieve above, we can create an additional replication slot on the
> > subscriber side, maintained by the apply worker. This slot is used to retain
> > the dead tuples. The apply worker will advance the slot.xmin after confirming
> > that all the concurrent transaction on publisher has been applied locally.

The replication slot used for this purpose will be a physical one or
logical one? And IIUC such a slot doesn't need to retain WAL but if we
do that, how do we advance the LSN of the slot?

> > 2) the apply worker send a new message to walsender to request the latest wal
> > flush position(GetFlushRecPtr) on publisher, and save it to
> > 'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
> > extend the existing keepalive message(e,g extends the requestReply bit in
> > keepalive message to add a 'request_wal_position' value)

The apply worker sends a keepalive message when it didn't receive
anything more than wal_receiver_timeout / 2. So in a very active
system, we cannot rely on piggybacking new information to the
keepalive messages to get the latest remote flush LSN.

> > 3) The apply worker can continue to apply changes. After applying all the WALs
> > upto 'candidate_remote_wal_lsn', the apply worker can then advance the
> > slot.xmin to 'candidate_xmin'.
> >
> > This approach ensures that dead tuples are not removed until all concurrent
> > transactions have been applied. It can be effective for both bidirectional and
> > non-bidirectional replication cases.
> >
> > We could introduce a boolean subscription option (retain_dead_tuples) to
> > control whether this feature is enabled. Each subscription intending to detect
> > update-delete conflicts should set retain_dead_tuples to true.
> >

I'm still studying this idea but let me confirm the following scenario.

Suppose both Node-A and Node-B have the same row (1,1) in table t, and
XIDs and commit LSNs of T2 and T3 are the following:

Node A
T2: DELETE FROM t WHERE id = 1 (10:02 AM) XID:100, commit-LSN:1000

Node B
T3: UPDATE t SET value = 2 WHERE id 1 (10:01 AM) XID:500, commit-LSN:5000

Further suppose that it's now 10:05 AM, and the latest XID and the
latest flush WAL position of Node-A and Node-B are following:

Node A
current XID: 300
latest flush LSN; 3000

Node B
current XID: 700
latest flush LSN: 7000

Both T2 and T3 are NOT sent to Node B and Node A yet, respectively
(i.e., the logical replication is delaying for 5 min).

Consider the following scenario:

1. The apply worker on Node-A calls GetRunningTransactionData() and
gets 301 (set as candidate_xmin).
2. The apply worker on Node-A requests the latest WAL flush position
from Node-B, and gets 7000 (set as candidate_remote_wal_lsn).
3. T2 is applied on Node-B, and the latest flush position of Node-B is now 8000.
4. The apply worker on Node-A continues applying changes, and applies
the transactions up to remote (commit) LSN 7100.
5. Now that the apply worker on Node-A applied all changes smaller
than candidate_remote_wal_lsn (7000), it increases the slot.xmin to
301 (candidate_xmin).
6. On Node-A, vacuum runs and physically removes the tuple that was
deleted by T2.

Here, on Node-B, there might be a transition between LSN 7100 and 8000
that might require the tuple that is deleted by T2.

For example, "UPDATE t SET value = 3 WHERE id = 1" (say T4) is
executed on Node-B at LSN 7200, and it's sent to Node-A after step 6.
On Node-A, whether we detect "update_deleted" or "update_missing"
still depends on when vacuum removes the tuple deleted by T2.

If applying T4 raises an "update_missing" (i.e. the changes are
applied in the order of T2->T3->(vacuum)->T4), it converts into an
insert, resulting in the table having a row with value = 3.

If applying T4 raises an "update_deleted" (i.e. the changes are
applied in the order of T2->T3->T4->(vacuum)), it's skipped, resulting
in the table having no row.

On the other hand, in this scenario, Node-B applies changes in the
order of T3->T4->T2, and applying T2 raises a "delete_origin_differ",
resulting in the table having a row with val=3 (assuming
latest_committs_win is the default resolver for this confliction).

Please confirm this scenario as I might be missing something.

>
> As each apply worker needs a separate slot to retain deleted rows, the
> requirement for slots will increase. The other possibility is to
> maintain one slot by launcher or some other central process that
> traverses all subscriptions, remember the ones marked with
> retain_dead_rows (let's call this list as retain_sub_list). Then using
> running_transactions get the oldest running_xact, and then get the
> remote flush location from the other node (publisher node) and store
> those as candidate values (candidate_xmin and
> candidate_remote_wal_lsn) in slot. We can probably reuse existing
> candidate variables of the slot. Next, we can check the remote_flush
> locations from all the origins corresponding in retain_sub_list and if
> all are ahead of candidate_remote_wal_lsn, we can update the slot's
> xmin to candidate_xmin.

Does it mean that we use one candiate_remote_wal_lsn in a slot for all
subscriptions (in retain_sub_list)? IIUC candiate_remote_wal_lsn is a
LSN of one of publishers, so other publishers could have completely
different LSNs. How do we compare the candidate_remote_wal_lsn to
remote_flush locations from all the origins?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2024-09-23 21:36:09 Re: Using per-transaction memory contexts for storing decoded tuples
Previous Message Shayon Mukherjee 2024-09-23 21:03:41 Re: Proposal to Enable/Disable Index using ALTER INDEX