Quick Links

Re: Built-in Raft replication

From:	Konstantin Osipov <kostja(dot)osipov(at)gmail(dot)com>
To:	Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc:	Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Nikolay Samokhvalov <nik(at)postgres(dot)ai>, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject:	Re: Built-in Raft replication
Date:	2025-04-16 09:47:00
Message-ID:	Z_98lANtXMJddscA@ark
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

* Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> [25/04/16 11:05]:
> Nikolay Samokhvalov <nik(at)postgres(dot)ai> writes:
> > This is exactly what I wanted to write as well. The idea is great. At the
> > same time, I think, consensus on many decisions will be extremely hard to
> > reach, so this project has a high risk of being very long. Unless it's an
> > extension, at least in the beginning.
>
> Yeah. The two questions you'd have to get past to get this into PG
> core are:
>
> 1. Why can't it be an extension? (You claimed it would work more
> seamlessly in core, but I don't think you've made a proven case.)

I think this can be best addressed when the discussion moves on to
an architecture design record, where the UX and implementation
details are outlined. I'm sure there can be a lot of bike-shedding
on that part. For now I merely wanted to know if:
- maybe there is a reason this will never be accepted
- maybe someone is already working on this.

From the replies I sense that while there is quite a bit of
scepticism about it ever making its way into the trunk, generally
there is no aversion to it. If my understanding is right,
it's a decent start.

> 2. Why depend on Raft rather than some other project?
>
> Longtime PG developers are going to be particularly hard on point 2,
> because we have a track record now of outliving outside projects
> that we thought we could rely on. One example here is the Snowball
> stemmer; while its upstream isn't quite dead, it's twitching only
> feebly, and seems to have a bus factor of 1. Another example is the
> Spencer regex engine; we thought we could depend on Tcl to be the
> upstream for that, but for a decade or more they've acted as though
> *we* are the upstream. And then there's libxml2. And uuid-ossp.
> And autoconf. And various documentation toolchains. Need I go on?
>
> The great advantage of implementing an outside dependency in an
> extension is that if the depended-on project dies, we can say a few
> words of mourning and move on. It's a lot harder to walk away from
> in-core features.

Raft is an algorithm, not a library. For a quick start the project
could use an existing library - I'd pick tidb's raft-rs, which
happens to be implemented in Rust, but going forward I'd guess the
community will want to have a plain C implementation.

There is a plethora of C implementations out there, but in my
somewhat educated opinion none are good enough for PostgreSQL
standards or purposes: ideally the protocol should be fully
isolated from storage and transport and extensively tested,
randomized & injection tests being a priority. Most of C
implementation I've seen are built by enthusiasts as a
self-education projects.

So at some point the project will need its own Raft
implementation. Good news is that the design of Raft internals
has been fairly well polished in all of the various
implementations in many different programming languages, so
it should be a fairly straightforward job.

Regarding the maintenance, since its first publishing back in ~2010
the protocol stabilized quite a bit. The core of the protocol
doesn't get many changes, I'd say nearly no changes, and it's also
noticeable in implementations, e.g. etcd-raft, raft-rs from tikv, etc
don't get many new commits nowadays.

Now a more broad question is whether or not Raft is an optimal
long term solution for log replication? Generally Raft is
leader-based, so in theory it could be replaced with a leader-less
protocol - e.g. FastPaxos, EPaxos, and newer
developments on top of those. To the best of my understanding all
leader-less algorithms which provide a single round-trip commit cost
require co-designing the transaction and replication
layer - which may be a way more intrusive change than adding raft
on top of the existing synchronous replication in PostgreSQL.

Given that Raft already provides an amortized single-round-trip
commit time, and the goal is simplicity of UX and unification,
I'd say it's wise to wait and see for the leader-less approaches
to mature.

At the end of the day, there is always a trade-off of trying to
do something today and waiting for perfection, but in case of Raft
in my personal opinion the balance is just right.

--
Konstantin Osipov, Moscow, Russia

In response to

Re: Built-in Raft replication at 2025-04-15 23:19:42 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Ashutosh Bapat	2025-04-16 09:47:54	Re: Fundamental scheduling bug in parallel restore of partitioned tables
Previous Message	Jakub Wartak	2025-04-16 09:14:32	NUMA shared memory interleaving