Re: HA Setup Review

From: akshay polji <akshay(dot)polji(at)gmail(dot)com>
To: Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
Cc: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Re: HA Setup Review
Date: 2024-04-30 16:29:22
Message-ID: CAHecRem5iCXo7LwhY8RqxKyH-TQa4Jyr8bVUUgiiSfb4uL=tuA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Thanks a lot Ron and Scott for sharing your insights.

--- Point - 1 ------
"Watchdog and heartbeat are built into PgPool. Is that what you're using
for WD and HB?" --> Yes.

I completely agree with you that with Synchronous Replication across the
data center any network glitch would freeze the primary database.
However, in the context cloud e.g., Azure we are planning to place the 3
Nodes the cluster in all different Availability Zones but still in the Same
Region.

As per Azure documentation "
https://learn.microsoft.com/en-us/azure/reliability/availability-zones-overview?tabs=azure-cli
"
*"Availability zones are close enough to have low-latency connections to
other availability zones. They're connected by a high-performance network
with a round-trip latency of less than 2ms." *

So do you think that even such a cluster with 3 node pgpool + postgresql
(running on the same machine) Synchronous Replication (Any one out of the
two replicas) would mean Primary DB will be at risk of degraded
performance?

---- Point - 2
For DR, we would add another Stand-by in a different region. But that would
be Asynchronous Replication.

---- Point - 3
"You can switch from async to sync replication just before patching, and
then switch back to async when it's completed." --> I am a little confused
here. What benefit do we get by switching from async to sync replication
before patching? I mean that would block the transactions on the primary DB
right? What am I missing?

--- Point - 4
" There will always be *some seconds of lag* while the secondary-that-was
is promoted to new-primary, and the applications that were
forcibly disconnected from the old primary are connected to the
new-primary. "
Agree - 100% .. But that's where the application needs a Retry Logic to
handle transient failures to avoid direct impact.
So to Deepak's questions, IMO true *ZERO downtime* needs to be solved from
both App and DB teams together and it's not really DB's problem to solve
independently.
"Easier said than done" :D.

Thanks,
Akshay.

On Tue, Apr 30, 2024 at 7:51 PM Ron Johnson <ronljohnsonjr(at)gmail(dot)com> wrote:

> You're confusing HA with DR.
>
> A 3-node cluster, with two in the primary DC and the third (asynchronously
> replicated) in the remote DC will give you both.
>
> ZERO downtime is -- to my knowledge -- impossible with master-slave
> replication. There will always be *some seconds of lag* while the
> secondary-that-was is promoted to new-primary, and the applications that
> were forcibly disconnected from the old primary are connected to the
> new-primary.
>
> Heck, even in a master-master DB cluster, any connections on the master
> that dies will be down until they can connect to the other master.
>
> On Tue, Apr 30, 2024 at 8:58 AM Deepak Pahuja . <deepakpahuja(at)hotmail(dot)com>
> wrote:
>
>> Hi Ron,
>>
>> Thanks for the details.
>>
>> Kindly share how we can achieve HA in postgresql, basically my
>> requirement is zero downtime for the application and the database.
>>
>> In this scenario we have to do failover and in that time there will be
>> outage, kindly correct me if I am wrong.
>>
>>
>> Also share how can we achieve zero downtime of database (primary write
>> available always) in PG.
>>
>> Thanks Deepak
>>
>> Sent from Outlook for Android <https://aka.ms/AAb9ysg>
>> ------------------------------
>> *From:* Ron Johnson <ronljohnsonjr(at)gmail(dot)com>
>> *Sent:* Tuesday, April 30, 2024 8:22:36 PM
>> *To:* pgsql-admin <pgsql-admin(at)postgresql(dot)org>
>> *Subject:* Re: HA Setup Review
>>
>> On Tue, Apr 30, 2024 at 3:41 AM akshay polji <akshay(dot)polji(at)gmail(dot)com>
>> wrote:
>>
>> Hello Team,
>>
>> I am looking for some feedback on the HA Setup that we are finalizing for
>> running our business critical workloads.
>>
>> We are planning to follow this Setup,
>> https://www.pgpool.net/docs/42/en/html/example-cluster.html
>>
>>
>> - Basically a 3 node PostgreSQL Cluster, running 3 processes i.e.
>> PostgreSQL DB, PGPool and WatchDog.
>> - These 3 nodes will be distributed across 3 availability zones/data
>> centers for resilience and use a synchronous replication between
>> Primary and Stand-by.
>>
>> You're describing HA+DR, not just HA,
>>
>> Also, I wouldn't do synchronous replication across the WAN. Not only is
>> the latency too high for decent performance, but any fault in the network
>> freezes the DB.
>>
>>
>> - Synchronous option will be Any One, so that the DB availability is
>> not impacted if 1 Stand-by is down for even planned outage i.e. Patching of
>> DB or Virtual Machine.
>>
>> You can switch from async to sync replication just before patching, and
>> then switch back to async when it's completed.
>>
>> That's pretty much what we do for HA, except only two DB instances (but
>> still three PgPool instances), and they are local and asynchronously
>> replicated. DR is handled by VMware SRM.
>>
>> Watchdog and heartbeat are built into PgPool. Is that what you're using
>> for WD and HB?
>>
>>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Scott Ribe 2024-04-30 16:38:34 Re: HA Setup Review
Previous Message Scott Ribe 2024-04-30 14:47:48 Re: HA Setup Review