Re: Manual failover cluster

From: Saul Perdomo <saul(dot)perdomo(at)gmail(dot)com>
To: Hispaniola Sol <moishap(at)hotmail(dot)com>
Cc: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Manual failover cluster
Date: 2021-08-23 17:37:43
Message-ID: CAN3jBgFXWwoKH+j=tsghVZrNEzuye-sHkAjY3uhN1ZVcgVPYgA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Moishe,

Since we use pgbackrest ourselves, this is the process I followed to set up
something similar on PG 10:

https://pgstef.github.io/2018/11/28/combining_pgbackrest_and_streaming_replication.html

(Not knowing much [if at all] about the reason for your requirements, I
would recommend looking into enabling PITR backups in addition to the hot
standby[s], which as you've seen are not bulletproof.)

Anyway, just wanted to add that your experience roughly matches ours -- it
seems that an attempt to promote the standby isn't always smooth. My sample
size is almost certainly smaller than yours, about half a dozen attempts.
In the past I've attributed this to a poor understanding on my part of the
behind-the-scenes of the process, and chalked it up to having made a
mistake or other in the manual steps (mainly in the order they are
executed). That being said, if you find a way to faithfully reproduce the
issue, I'm sure the community will want to know, there is certainly an
expectation that the failover is reliable from the PG software side of
things, as long as there are no hardware nor operator issues! Again, not
knowing a lot about your setup, my first instinct would be to troubleshoot
your automated scripts, you might find that you need to change the order
things are run when on server B vs server A, for example..

On Fri, Aug 20, 2021 at 9:23 AM Hispaniola Sol <moishap(at)hotmail(dot)com> wrote:

> Team,
>
> I have a pg 10 cluster with a master and two hot-standby nodes. There is a
> requirement for a manual failover (nodes switching the roles) at will. This
> is a vanilla 3 node PG cluster that was built with WAL archiving (central
> location) and streaming replication to two hot standby nodes. The failover
> is scripted in Ansible. Ansible massages and moves around the
> archive/restore scripts, the conf files and the trigger and calls `
> pg_ctlcluster` to start/stop. This part _seems_ to be doing the job fine.
>
> The issue I am struggling with is the apparent fragility of the process -
> all 3 nodes will end up in a "good" state after the switch only every other
> time. Other times I have to rebase the hot-standby from the new master with
> pg_basebackup. It seems the issues are mostly with those nodes, ending up
> as slaves after the roles switch runs.
> They get errors like mismatch in timelines, recovering from the same WAL
> over and over again, invalid resource manager ID in primary checkpoint
> record, etc.
>
> In this light, I am wondering - using what's offered by PostgreSQL itself,
> i.e. streaming WAL replication with log shipping - can I expect to have
> this kind of failover 100% reliable on PG side ? Anyone is doing this
> reliably on PostgreSQL 10.1x ?
>
> Thanks !
>
> Moishe
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Saul Perdomo 2021-08-23 17:42:08 Re: Manual failover cluster
Previous Message Ninad Shah 2021-08-23 16:12:50 Re: Manual failover cluster