Quick Links

Re: Built-in Raft replication

From:	Alastair Turner <minion(at)decodable(dot)me>
To:	Konstantin Osipov <kostja(dot)osipov(at)gmail(dot)com>
Cc:	Ashutosh Bapat <ashutosh(dot)bapat(dot)oss(at)gmail(dot)com>, Andrey Borodin <x4mmm(at)yandex-team(dot)ru>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Greg Sabino Mullane <htamfids(at)gmail(dot)com>, Nikolay Samokhvalov <nik(at)postgres(dot)ai>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject:	Re: Built-in Raft replication
Date:	2025-04-16 21:45:45
Message-ID:	CAC0Gmyy22XvRx5JkL0mvAMiBq4qVNa5+P1kDVcXzFUp-=ypk-Q@mail.gmail.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hi Konstantin

On Wed, 16 Apr 2025 at 15:07, Konstantin Osipov <kostja(dot)osipov(at)gmail(dot)com>
wrote:

> * Alastair Turner <minion(at)decodable(dot)me> [25/04/16 15:58]:
>
> > > > If you use build-in failover you have to resort to 3 big Postgres
> > > machines because you need 2/3 majority. Of course, you can install
> > > MySQL-stype arbiter - host that had no real PGDATA, only participates
> in
> > > voting. But this is a solution to problem induced by built-in
> autofailover.
> > >
> > > Users find it a waste of resources to deploy 3 big PostgreSQL
> > > instances just for HA where 2 suffice even if they deploy 3
> > > lightweight DCS instances. Having only some of the nodes act as DCS
> > > and others purely PostgreSQL nodes will reduce waste of resources.
> > >
> > > The experience of other projects/products with automated failover
> based on
> > quorum shows that this is a critical issue for adoption. In the In-memory
> > Data Grid space (Coherence, Geode/GemFire) the question of how to ensure
> > that some nodes didn't carry any data comes up early in many architecture
> > discussions. When RabbitMQ shipped their Quorum Queues feature, the first
> > and hardest area of pushback was around all nodes hosting message
> content.
> >
> > It's not just about the requirement for compute resources, it's also
> about
> > bandwidth and latency. Many large organisations have, for historical
> > reasons, pairs of data centres with very good point-to-point
> connectivity.
> > As the requirement for quorum witnesses has come up for all sorts of
> > things, including storage arrays, they have built arbiter/witness sites
> at
> > branches, colocation providers or even on the public cloud. More than not
> > holding user data or processing queries, the arbiter can't even be sent
> the
> > replication stream for the user data in the database, it just won't fit
> > down the pipe.
> >
> > Which feels like a very difficult requirement to meet if the replication
> > model for all data is being changed to a quorum model.
>
> I agree master/replica deployment layouts are very popular and are
> not going to directly benefit from raft. They'll still work, but no
> automation will be available, just like today with Patroni.
>
> Users of Patroni and etcd setups can get automation for two-site
primary/replica pairs by putting a third etcd node on a third site. Which
only requires moving the membership/leadership data to the arbiter site,
not all database activity.

> However, if the storage cost is an argument, then the logical path is to
> disaggregate storage/compute altogether, i.e. use projects like
> neon.
>
> The issue is not generally storage, but network. There may simply not be
enough bandwidth available to transmit the whole WAL to the arbiter site.

Many on-premises IT setups have this limitation in some form.

If your proposal would leave these large, traditional user organisations
(which account for thousands of Postgres HA pairs or DR pairs) doing what
they currently do with wraparound tooling like Patroni, and create a new,
in core, option for balanced 3, 5, 7... member groups, then I don't think
it's worth doing.

Regards,
Alastair

In response to

Re: Built-in Raft replication at 2025-04-16 14:07:36 from Konstantin Osipov

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Daniel Gustafsson	2025-04-16 21:52:42	Re: jsonapi: scary new warnings with LTO enabled
Previous Message	Tom Lane	2025-04-16 21:42:41	jsonapi: scary new warnings with LTO enabled