RE: Conflict detection for update_deleted in logical replication

From: "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>
To: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>, Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: RE: Conflict detection for update_deleted in logical replication
Date: 2024-09-20 02:54:59
Message-ID: OS0PR01MB5716A78EE3D6F2F4754E1209946C2@OS0PR01MB5716.jpnprd01.prod.outlook.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

> -----Original Message-----
> From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
> Sent: Friday, September 20, 2024 2:49 AM
> To: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> Cc: shveta malik <shveta(dot)malik(at)gmail(dot)com>; Hou, Zhijie/侯 志杰
> <houzj(dot)fnst(at)fujitsu(dot)com>; pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
> Subject: Re: Conflict detection for update_deleted in logical replication
>
> On Tue, Sep 17, 2024 at 9:29 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
> wrote:
> >
> > On Tue, Sep 17, 2024 at 11:24 PM Masahiko Sawada
> <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > >
> > > On Mon, Sep 16, 2024 at 11:53 PM Amit Kapila
> <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Sep 17, 2024 at 6:08 AM Masahiko Sawada
> <sawada(dot)mshk(at)gmail(dot)com> wrote:
> > > >
> > > > I haven't thought about the implementation details yet but I think
> > > > during pruning (for example in heap_prune_satisfies_vacuum()),
> > > > apart from checking if the tuple satisfies
> > > > HeapTupleSatisfiesVacuumHorizon(), we should also check if the
> > > > tuple's committs is greater than configured vacuum_committs_age
> > > > (for the
> > > > table) to decide whether tuple can be removed.
> > >
> > > Sounds very costly. I think we need to do performance tests. Even if
> > > the vacuum gets slower only on the particular table having the
> > > vacuum_committs_age setting, it would affect overall autovacuum
> > > performance. Also, it would affect HOT pruning performance.
> > >
> >
> > Agreed that we should do some performance testing and additionally
> > think of any better way to implement. I think the cost won't be much
> > if the tuples to be removed are from a single transaction because the
> > required commit_ts information would be cached but when the tuples are
> > from different transactions, we could see a noticeable impact. We need
> > to test to say anything concrete on this.
>
> Agreed.
>
> >
> > > >
> > > > > > but IIUC this value doesn’t need to be significant; it can be
> > > > > > limited to just a few minutes. The one which is sufficient to
> > > > > > handle replication delays caused by network lag or other
> > > > > > factors, assuming clock skew has already been addressed.
> > > > >
> > > > > I think that in a non-bidirectional case the value could need to
> > > > > be a large number. Is that right?
> > > > >
> > > >
> > > > As per my understanding, even for non-bidirectional cases, the
> > > > value should be small. For example, in the case, pointed out by
> > > > Shveta [1], where the updates from 2 nodes are received by a third
> > > > node, this setting is expected to be small. This setting primarily
> > > > deals with concurrent transactions on multiple nodes, so it should
> > > > be small but I could be missing something.
> > > >
> > >
> > > I might be missing something but the scenario I was thinking of is
> > > something below.
> > >
> > > Suppose that we setup uni-directional logical replication between
> > > Node A and Node B (e.g., Node A -> Node B) and both nodes have the
> > > same row with key = 1:
> > >
> > > Node A:
> > > T1: UPDATE t SET val = 2 WHERE key = 1; (10:00 AM)
> > > -> This change is applied on Node B at 10:01 AM.
> > >
> > > Node B:
> > > T2: DELETE FROM t WHERE key = 1; (05:00 AM)
> > >
> > > If a vacuum runs on Node B at 06:00 AM, the change of T1 coming from
> > > Node A would raise an "update_missing" conflict. On the other hand,
> > > if a vacuum runs on Node B at 11:00 AM, the change would raise an
> > > "update_deleted" conflict. It looks whether we detect an
> > > "update_deleted" or an "updated_missing" depends on the timing of
> > > vacuum, and to avoid such a situation, we would need to set
> > > vacuum_committs_age to more than 5 hours.
> > >
> >
> > Yeah, in this case, it would detect a different conflict (if we don't
> > set vacuum_committs_age to greater than 5 hours) but as per my
> > understanding, the primary purpose of conflict detection and
> > resolution is to avoid data inconsistency in a bi-directional setup.
> > Assume, in the above case it is a bi-directional setup, then we want
> > to have the same data in both nodes. Now, if there are other cases
> > like the one you mentioned that require to detect the conflict
> > reliably than I agree this value could be large and probably not the
> > best way to achieve it. I think we can mention in the docs that the
> > primary purpose of this is to achieve data consistency among
> > bi-directional kind of setups.
> >
> > Having said that even in the above case, the result should be the same
> > whether the vacuum has removed the row or not. Say, if the vacuum has
> > not yet removed the row (due to vacuum_committs_age or otherwise) then
> > also because the incoming update has a later timestamp, we will
> > convert the update to insert as per last_update_wins resolution
> > method, so the conflict will be considered as update_missing. And,
> > say, the vacuum has removed the row and the conflict detected is
> > update_missing, then also we will convert the update to insert. In
> > short, if UPDATE has lower commit-ts, DELETE should win and if UPDATE
> > has higher commit-ts, UPDATE should win.
> >
> > So, we can expect data consistency in bidirectional cases and expect a
> > deterministic behavior in other cases (e.g. the final data in a table
> > does not depend on the order of applying the transactions from other
> > nodes).
>
> Agreed.
>
> I think that such a time-based configuration parameter would be a reasonable
> solution. The current concerns are that it might affect vacuum performance and
> lead to a similar bug we had with vacuum_defer_cleanup_age.

Thanks for the feedback!

I am working on the POC patch and doing some initial performance tests on this idea.
I will share the results after finishing.

Apart from the vacuum_defer_cleanup_age idea. we’ve given more thought to our
approach for retaining dead tuples and have come up with another idea that can
reliably detect conflicts without requiring users to choose a wise value for
the vacuum_committs_age. This new idea could also reduce the performance
impact. Thanks a lot to Amit for off-list discussion.

The concept of the new idea is that, the dead tuples are only useful to detect
conflicts when applying *concurrent* transactions from remotes. Any subsequent
UPDATE from a remote node after removing the dead tuples should have a later
timestamp, meaning it's reasonable to detect an update_missing scenario and
convert the UPDATE to an INSERT when applying it.

To achieve above, we can create an additional replication slot on the
subscriber side, maintained by the apply worker. This slot is used to retain
the dead tuples. The apply worker will advance the slot.xmin after confirming
that all the concurrent transaction on publisher has been applied locally.

The process of advancing the slot.xmin could be:

1) the apply worker call GetRunningTransactionData() to get the
'oldestRunningXid' and consider this as 'candidate_xmin'.
2) the apply worker send a new message to walsender to request the latest wal
flush position(GetFlushRecPtr) on publisher, and save it to
'candidate_remote_wal_lsn'. Here we could introduce a new feedback message or
extend the existing keepalive message(e,g extends the requestReply bit in
keepalive message to add a 'request_wal_position' value)
3) The apply worker can continue to apply changes. After applying all the WALs
upto 'candidate_remote_wal_lsn', the apply worker can then advance the
slot.xmin to 'candidate_xmin'.

This approach ensures that dead tuples are not removed until all concurrent
transactions have been applied. It can be effective for both bidirectional and
non-bidirectional replication cases.

We could introduce a boolean subscription option (retain_dead_tuples) to
control whether this feature is enabled. Each subscription intending to detect
update-delete conflicts should set retain_dead_tuples to true.

The following explains how it works in different cases to achieve data
consistency:

--
2 nodes, bidirectional case 1:
--
Node A:
T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
T2: DELETE FROM t WHERE id = 1; ts=10.02 AM

Node B:
T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.01 AM

subscription retain_dead_tuples = true/false

After executing T2, the apply worker on Node A will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

--
2 nodes, bidirectional case 2:
--
Node A:
T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
T2: DELETE FROM t WHERE id = 1; ts=10.01 AM

Node B:
T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.02 AM

After executing T2, the apply worker on Node A will request the latest wal
flush location on Node B. And the T3 is either running concurrently or has not
started. In both cases, the T3 must have a later timestamp. So, even if the
dead tuple is removed in this cases and update_missing is detected, the default
resolution is to convert UDPATE to INSERT which is OK because the data are
still consistent on Node A and B.

--
3 nodes, non-bidirectional, Node C subscribes to both Node A and Node B:
--

Node A:
T1: INSERT INTO t (id, value) VALUES (1,1); ts=10.00 AM
T2: DELETE FROM t WHERE id = 1; ts=10.01 AM

Node B:
T3: UPDATE t SET value = 2 WHERE id = 1; ts=10.02 AM

Node C:
apply T1, T2, T3

After applying T2, the apply worker on Node C will check the latest wal flush
location on Node B. Till that time, the T3 should have finished, so the xmin
will be advanced only after applying the WALs that is later than T3. So, the
dead tuple will not be removed before applying the T3, which means the
update_delete can be detected.

Your feedback on this idea would be greatly appreciated.

Best Regards,
Hou zj

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Hayato Kuroda (Fujitsu) 2024-09-20 03:00:49 RE: [Proposal] Add foreign-server health checks infrastructure
Previous Message Nathan Bossart 2024-09-20 02:46:00 Re: Should rolpassword be toastable?