Re: Conflict Detection and Resolution

From: shveta malik <shveta(dot)malik(at)gmail(dot)com>
To: Dilip Kumar <dilipbalaut(at)gmail(dot)com>
Cc: Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>, Tomas Vondra <tomas(dot)vondra(at)enterprisedb(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, "Zhijie Hou (Fujitsu)" <houzj(dot)fnst(at)fujitsu(dot)com>, Nisha Moond <nisha(dot)moond412(at)gmail(dot)com>, shveta malik <shveta(dot)malik(at)gmail(dot)com>
Subject: Re: Conflict Detection and Resolution
Date: 2024-06-18 09:58:51
Message-ID: CAJpy0uDCW+vrBoUZWrBWPjsM=9wwpwbpZuZa8Raj3VqeVYs3PQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Jun 18, 2024 at 11:34 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
>
> On Tue, Jun 18, 2024 at 10:17 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> >
> > On Mon, Jun 17, 2024 at 3:23 PM Amit Kapila <amit(dot)kapila16(at)gmail(dot)com> wrote:
> > >
> > > On Wed, Jun 12, 2024 at 10:03 AM Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote:
> > > >
> > > > On Tue, Jun 11, 2024 at 7:44 PM Tomas Vondra
> > > > <tomas(dot)vondra(at)enterprisedb(dot)com> wrote:
> > > >
> > > > > > Yes, that's correct. However, many cases could benefit from the
> > > > > > update_deleted conflict type if it can be implemented reliably. That's
> > > > > > why we wanted to give it a try. But if we can't achieve predictable
> > > > > > results with it, I'm fine to drop this approach and conflict_type. We
> > > > > > can consider a better design in the future that doesn't depend on
> > > > > > non-vacuumed entries and provides a more robust method for identifying
> > > > > > deleted rows.
> > > > > >
> > > > >
> > > > > I agree having a separate update_deleted conflict would be beneficial,
> > > > > I'm not arguing against that - my point is actually that I think this
> > > > > conflict type is required, and that it needs to be detected reliably.
> > > > >
> > > >
> > > > When working with a distributed system, we must accept some form of
> > > > eventual consistency model. However, it's essential to design a
> > > > predictable and acceptable behavior. For example, if a change is a
> > > > result of a previous operation (such as an update on node B triggered
> > > > after observing an operation on node A), we can say that the operation
> > > > on node A happened before the operation on node B. Conversely, if
> > > > operations on nodes A and B are independent, we consider them
> > > > concurrent.
> > > >
> > > > In distributed systems, clock skew is a known issue. To establish a
> > > > consistency model, we need to ensure it guarantees the
> > > > "happens-before" relationship. Consider a scenario with three nodes:
> > > > NodeA, NodeB, and NodeC. If NodeA sends changes to NodeB, and
> > > > subsequently NodeB makes changes, and then both NodeA's and NodeB's
> > > > changes are sent to NodeC, the clock skew might make NodeB's changes
> > > > appear to have occurred before NodeA's changes. However, we should
> > > > maintain data that indicates NodeB's changes were triggered after
> > > > NodeA's changes arrived at NodeB. This implies that logically, NodeB's
> > > > changes happened after NodeA's changes, despite what the timestamps
> > > > suggest.
> > > >
> > > > A common method to handle such cases is using vector clocks for
> > > > conflict resolution.
> > > >
> > >
> > > I think the unbounded size of the vector could be a problem to store
> > > for each event. However, while researching previous discussions, it
> > > came to our notice that we have discussed this topic in the past as
> > > well in the context of standbys. For recovery_min_apply_delay, we
> > > decided the clock skew is not a problem as the settings of this
> > > parameter are much larger than typical time deviations between servers
> > > as mentioned in docs. Similarly for casual reads [1], there was a
> > > proposal to introduce max_clock_skew parameter and suggesting the user
> > > to make sure to have NTP set up correctly. We have tried to check
> > > other databases (like Ora and BDR) where CDR is implemented but didn't
> > > find anything specific to clock skew. So, I propose to go with a GUC
> > > like max_clock_skew such that if the difference of time between the
> > > incoming transaction's commit time and the local time is more than
> > > max_clock_skew then we raise an ERROR. It is not clear to me that
> > > putting bigger effort into clock skew is worth especially when other
> > > systems providing CDR feature (like Ora or BDR) for decades have not
> > > done anything like vector clocks. It is possible that this is less of
> > > a problem w.r.t CDR and just detecting the anomaly in clock skew is
> > > good enough.
> >
> > I believe that if we've accepted this solution elsewhere, then we can
> > also consider the same. Basically, we're allowing the application to
> > set its tolerance for clock skew. And, if the skew exceeds that
> > tolerance, it's the application's responsibility to synchronize;
> > otherwise, an error will occur. This approach seems reasonable.
>
> This model can be further extended by making the apply worker wait if
> the remote transaction's commit_ts is greater than the local
> timestamp. This ensures that no local transactions occurring after the
> remote transaction appear to have happened earlier due to clock skew
> instead we make them happen before the remote transaction by delaying
> the remote transaction apply. Essentially, by having the remote
> application wait until the local timestamp matches the remote
> transaction's timestamp, we ensure that the remote transaction, which
> seems to occur after concurrent local transactions due to clock skew,
> is actually applied after those transactions.
>
> With this model, there should be no ordering errors from the
> application's perspective as well if synchronous commit is enabled.
> The transaction initiated by the publisher cannot be completed until
> it is applied to the synchronous subscriber. This ensures that if the
> subscriber's clock is lagging behind the publisher's clock, the
> transaction will not be applied until the subscriber's local clock is
> in sync, preventing the transaction from being completed out of order.

I tried to work out a few scenarios with this, where the apply worker
will wait until its local clock hits 'remote_commit_tts - max_skew
permitted'. Please have a look.

Let's say, we have a GUC to configure max_clock_skew permitted.
Resolver is last_update_wins in both cases.

----------------
1) Case 1: max_clock_skew set to 0 i.e. no tolerance for clock skew.

Remote Update with commit_timestamp = 10.20AM.
Local clock (which is say 5 min behind) shows = 10.15AM.

When remote update arrives at local node, we see that skew is greater
than max_clock_skew and thus apply worker waits till local clock hits
'remote's commit_tts - max_clock_skew' i.e. till 10.20 AM. Once the
local clock hits 10.20 AM, the worker applies the remote change with
commit_tts of 10.20AM. In the meantime (during wait period of apply
worker)) if some local update on same row has happened at say 10.18am,
that will applied first, which will be later overwritten by above
remote change of 10.20AM as remote-change's timestamp appear more
latest, even though it has happened earlier than local change.

2) Case 2: max_clock_skew is set to 2min.

Remote Update with commit_timestamp=10.20AM
Local clock (which is say 5 min behind) = 10.15AM.

Now apply worker will notice skew greater than 2min and thus will wait
till local clock hits 'remote's commit_tts - max_clock_skew' i.e.
10.18 and will apply the change with commit_tts of 10.20 ( as we
always save the origin's commit timestamp into local commit_tts, see
RecordTransactionCommit->TransactionTreeSetCommitTsData). Now lets say
another local update is triggered at 10.19am, it will be applied
locally but it will be ignored on remote node. On the remote node ,
the existing change with a timestamp of 10.20 am will win resulting in
data divergence.
----------

In case 1, the local change which was otherwise triggered later than
the remote change is overwritten by remote change. And in Case2, it
results in data divergence. Is this behaviour in both cases expected?
Or am I getting the wait logic wrong? Thoughts?

thanks
Shveta

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2024-06-18 10:05:23 Re: post-freeze damage control
Previous Message Ashutosh Bapat 2024-06-18 09:49:41 Re: Is creating logical replication slots in template databases useful at all?