Quick Links

Re: Built-in Raft replication

From:	Konstantin Osipov <kostja(dot)osipov(at)gmail(dot)com>
To:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>
Cc:	Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Nikolay Samokhvalov <nik(at)postgres(dot)ai>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Built-in Raft replication
Date:	2025-04-16 09:53:09
Message-ID:	Z_9-BR89w-DLeFv3@ark
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

* Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com> [25/04/16 11:06]:
> > My view is what Konstantin wants is automatic replication topology management. For some reason this technology is called HA, DCS, Raft, Paxos and many other scary words. But basically it manages primary_conn_info of some nodes to provide some fault-tolerance properties. I'd start to design from here, not from Raft paper.
> >
> In my experience, the load of managing hundreds of replicas which all
> participate in RAFT protocol becomes more than regular transaction
> load. So making every replica a RAFT participant will affect the
> ability to deploy hundreds of replica.

I think this experience needs to be detailed out. There are
implementations in the field that are less efficient than others.

Early etcd-raft didn't have pre-voting and had "bastardized"
(their own definition) implementation of configuration changes
which didn't use joint consensus.

Then there is a liveness issue if leader election is implemented
in a straightforward way in large clusters. But this is addressed:
scaling up the randomized election timeout with the cluster size,
converting most of participants to non-voters in large clusters.

Raft replication, again, if implemented in a naive way, would
require a O(outstanding transaction) * number of replicas amount of
RAM. But that doesn't have to be naive.

To sum up, I am not aware of any principal limitations in this
area.

--
Konstantin Osipov, Moscow, Russia

In response to

Re: Built-in Raft replication at 2025-04-16 04:33:15 from Ashutosh Bapat

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Konstantin Osipov	2025-04-16 09:58:32	Re: Built-in Raft replication
Previous Message	Ashutosh Bapat	2025-04-16 09:47:54	Re: Fundamental scheduling bug in parallel restore of partitioned tables