From: | Konstantin Osipov <kostja(dot)osipov(at)gmail(dot)com> |
---|---|
To: | Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> |
Cc: | Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Nikolay Samokhvalov <nik(at)postgres(dot)ai>, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Built-in Raft replication |
Date: | 2025-04-16 09:53:09 |
Message-ID: | Z_9-BR89w-DLeFv3@ark |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
* Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> [25/04/16 11:06]:
> > My view is what Konstantin wants is automatic replication topology management. For some reason this technology is called HA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to provide some fault-tolerance properties. I'd start to design from here, not from Raft paper.
> >
> In my experience, the load of managing hundreds of replicas which all
> participate in RAFT protocol becomes more than regular transaction
> load. So making every replica a RAFT participant will affect the
> ability to deploy hundreds of replica.
I think this experience needs to be detailed out. There are
implementations in the field that are less efficient than others.
Early etcd-raft didn't have pre-voting and had "bastardized"
(their own definition) implementation of configuration changes
which didn't use joint consensus.
Then there is a liveness issue if leader election is implemented
in a straightforward way in large clusters. But this is addressed:
scaling up the randomized election timeout with the cluster size,
converting most of participants to non-voters in large clusters.
Raft replication, again, if implemented in a naive way, would
require a O(outstanding transaction) * number of replicas amount of
RAM. But that doesn't have to be naive.
To sum up, I am not aware of any principal limitations in this
area.
--
Konstantin Osipov, Moscow, Russia
From | Date | Subject | |
---|---|---|---|
Next Message | Konstantin Osipov | 2025-04-16 09:58:32 | Re: Built-in Raft replication |
Previous Message | Ashutosh Bapat | 2025-04-16 09:47:54 | Re: Fundamental scheduling bug in parallel restore of partitioned tables |