Re: Failback to old master

From: Robert Haas <robertmhaas(at)gmail(dot)com>
To: "Maeldron T(dot)" <maeldron(at)gmail(dot)com>
Cc: "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Failback to old master
Date: 2014-10-29 14:41:24
Message-ID: CA+TgmoYDRgOBKY5L4rnpJTNfdE8YJf3dLf76o9t7h5qv=U=TJw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Oct 29, 2014 at 6:21 AM, Maeldron T. <maeldron(at)gmail(dot)com> wrote:
> I swear I have read a couple of old threads. Yet I am not sure if it safe to
> failback to the old master in case of async replication without base backup.
>
> Considering:
> I have the latest 9.3 server
> A: master
> B: slave
> B is actively connected to A
>
> I shut down A manually with -m fast (it's the default FreeBSD init script
> setting)
> I remove the recovery.conf from B
> I restart B
> I create a recovery.conf on A
> I start A
> I see nothing wrong in the logs
> I go for a lunch
> I shut down B
> I remove the recovery.conf on AI restart A
> I restore the recovery.conf on B
> I start B
> I see nothing wrong in the logs and I see that replication is working
>
> Can I say that my data is safe in this case?
>
> If the answer is yes, is it safe to do this if there was a power outage on A
> instead of manual shutdown? Considering that the log says nothing wrong. (Of
> course if it complains I'd do base backup from B).

The threshold question here is whether the original master might have
written (and thus, perhaps, applied) write-ahead log records that were
not replayed on the slave. If A crashed, that is definitely possible,
so this is definitely not safe. If A was shut down cleanly, then
streaming replication *should* take everything up through the shutdown
checkpoint and replicate those to the standby, which *should* replay
them. If all goes according to plan, I think this will work.

I'm not sure we really have enough safeties to make this robust,
though: for example, at the point when the shutdown checkpoint is
written, I believe that the master is no longer accepting new
connections - so if the connection to the slave is broken before the
shutdown checkpoint record is replicated, then it's not safe any more,
but how will we detect that? And, if you remove recovery.conf on the
slave, it will abort replay and enter normal running as soon as it
reaches what it thinks is end-of-WAL, with no cross-check to make sure
that's really the same was point that the master was actually at. So
it strikes me that it might be quite difficult to really have
confidence that nothing will go wrong.

I'm definitely not the expert in this area on this mailing list, so
I'm curious what others think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Robert Haas 2014-10-29 14:45:05 Re: WIP: Access method extendability
Previous Message Tom Lane 2014-10-29 14:24:26 Re: Validating CHECK constraints with SPI