Re: Postgresql 9.6.12 switchover failing

From: M Jane <mskiesewetter(at)gmail(dot)com>
To: "Jehan-Guillaume (ioguix) de Rorthais" <ioguix(at)free(dot)fr>
Cc: Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com>, Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re: Postgresql 9.6.12 switchover failing
Date: 2019-03-11 16:44:52
Message-ID: CANBd27+4FdHYxRT8eCFWGoV_NymqkpOKKX=jpUpbtBwHvv0XjA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi,
I will do the test tomorrow morning with your suggestions and will capture
the logs and steps.
Thanks for the advice.
regards,
Marian

Op ma 11 mrt. 2019 om 15:55 schreef Jehan-Guillaume (ioguix) de Rorthais <
ioguix(at)free(dot)fr>:

> On Mon, 11 Mar 2019 12:17:59 +0100
> M Jane <mskiesewetter(at)gmail(dot)com> wrote:
>
> > Hi,
> > Thanks for your reply
> > I will try to put in all the logging I collected when doing the
> switchover.
> > I did the following:
> > 1. Check if the standby is uptodate
> > I ran the following query on the standby:
> > SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
> > pg_last_xlog_replay_location()) AS replication_delay_bytes;
> > This resulted in 0.
>
> This is not required. You don't need to check if some received XLOG are not
> replayed on data yet. As far as the XLOG is on disk on the standby side,
> it will
> be replayed before promoting by default. Quoting doc:
>
> pg_last_xlog_receive_location():
> Get last transaction log location received and synced to disk by
> streaming
> replication.
>
> In this step, you should check the standby is connected to the primary.
> Either
> be listing your processes on both side, check pg_stat_replication, etc.
>
> > 2. I shutdown the primary instance
> > and checked the alertlog of the primary:
> [...]
> > < 2019-03-08 09:46:41.159 CET >LOG: received smart shutdown request
> > < 2019-03-08 09:46:41.160 CET >LOG: autovacuum launcher shutting down
> > < 2019-03-08 09:46:41.160 CET >FATAL: terminating connection due to
> > administrator command
>
> What is this FATAL? Did you issue some more administrative commands after
> the
> initial "pg_ctl -m smart stop"?
>
> I would recommend using "pg_ctl -m fast stop" and nothing else.
>
> > 3. Then i promoted the slave with pg_ctl promote and checked the
> alertlog:
>
> What was the last received LSN? What was the last WAL file the standby
> retrieved from archives?
>
> Do you have log lines before "LOG: selected new timeline ID: 4" ?
>
> [...]
> > 4. On the former primary I created a recovery.conf
> > cat recovery.conf
> > restore_command = 'cat archive_dir/%f | gunzip > %p'
>
> "cat" is useless here. I recommend: "gzip -dc archive_dir/%f > %p"
>
> > recovery_target_timeline = 'latest'
> > standby_mode = on
> > primary_conninfo = 'host=newprimary port=5432 user=replication
> password=xxxx'
> > trigger_file = '/tmp/instance.trigger.5432'
>
> You should not put the trigger file in a public accessible folder. Anyone
> can
> triggers the promotion...
> You should only rely on "pg_ctl promote" if possible.
>
> > 5. And started te instance on the former primary
> > alertlog former primary:
> [...]
> > < 2019-03-08 09:53:04.591 CET >FATAL: requested timeline 4 is not a
> child
> > of this server's history
> > < 2019-03-08 09:53:04.591 CET >DETAIL: Latest checkpoint is at
> 2/A8000028
> > on timeline 3, but in the history of the requested timeline, the server
> > forked off from that timeline at 2/A70000D0.
>
> According to these log lines:
>
> * during shutdown, old primary archived 0000000300000002000000A7 and wrote
> its
> shutdown checkpoint as first xlog record in 0000000300000002000000A8. The
> following command should show it:
>
> pg_xlogdump "pg_xlog/0000000300000002000000A8"
>
> * for some reasons, the standby never received any content from
> 0000000300000002000000A8. Neither be streaming rep, nor restored from
> archives
>
> * the promoted standby forked its timeline from the last received LSN that
> was
> in early 0000000300000002000000A7 file (2/A70000D0)
>
> * you end up with a new master that forked (probably) right before the
> shutdown
> checkpoint of the old master (as far as the next record in 2/A70000D0
> was an
> XLOG SWITCH).
>
> In conclusion, the old master is right: the fork happen before its own last
> writes.
>
> > Am I missing something?
>
> You should really take care of the first steps. Why the standby did not
> received the shutdown checkpoint? Was it connected when you issued the
> "pg_ctl
> stop"? Did the wal_sender processus been killed? A network issue? Why the
> standby has not been able to restore the last segment from archives before
> promoting?
>
> I usually compare the last checkpoint LSN in the controldata from the old
> master with what received the standby to promote, before promoting it.
>
> > Please advise?
> > Thanks in advance.
> > regards,
> > Marian
> >
> > Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais <
> > ioguix(at)free(dot)fr>:
> >
> > > On Fri, 8 Mar 2019 22:10:15 +0530
> > > Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com> wrote:
> > >
> > > > Hi Marian,
> > > >
> > > > Yes, We have faced the same issue, generally, this type of issue
> occurs
> > > due
> > > > to missing of history file i.e(00000000.history) during the
> replication
> > > > setup using the pg_basebackup.
> > > >
> > > > To resolve this type issue you need to create an empty history file
> with
> > > > name 00000000.history on the master server then rebuild old master by
> > > > initiating the pg_basebackup.
> > >
> > > Uh, what? Do you have further info about this? I fail to understand how
> > > this is
> > > related to OP question...
> > >
> > > > On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter(at)gmail(dot)com>
> wrote:
> > > ...
> > > > > But bringing up the old primary as a slave doesn't work. I get
> errors (
> > > > > like >FATAL: requested timeline 5 is not a child of this server's
> > > history)
> > > > > and the instances shuts down automatically.
> > >
> > > While doing a clean switchover, I never seen this (yet).
> > >
> > > How are you checking both servers are up-to-date? Do you check that the
> > > standby
> > > received everything from the old primary, including its shutdown
> > > checkpoint,
> > > before promoting?
> > >
> > > Make sure your standby staid connected to your old primary until the
> later
> > > is
> > > completely shutdown-ed. Nothing should disconnect them during the whole
> > > process. And make sure to use mode fast or smart to stop the old
> primary.
> > >
> > > Are you able to share with us the restart checkpoint on the old
> > > primary and the history file created by the new-primary? How far are
> the
> > > restart LSN and timeline fork from each other?
> > >
> > > Regards,
> > >
>
>
>
> --
> Jehan-Guillaume de Rorthais
> Dalibo
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message M Jane 2019-03-12 15:30:46 Re: Postgresql 9.6.12 switchover failing
Previous Message Jehan-Guillaume (ioguix) de Rorthais 2019-03-11 14:55:36 Re: Postgresql 9.6.12 switchover failing