Re: Postgresql 9.6.12 switchover failing

From: M Jane <mskiesewetter(at)gmail(dot)com>
To: "Jehan-Guillaume (ioguix) de Rorthais" <ioguix(at)free(dot)fr>
Cc: Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com>, Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re: Postgresql 9.6.12 switchover failing
Date: 2019-03-12 15:30:46
Message-ID: CANBd27++C1vV=p+=3JO9g8yaLaJErHCaLTEPHovVNpj24wAiFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi Jehan-Guillaume,
Thanks for the support.
issue is solved thanks to your help.
The culprit was in the service script which did not stop the instance
correct, resulting in a standby that did not receive all xlogs.
Fixed the service script and I am able to switchover without any problem.

Regards,
Marian

Op ma 11 mrt. 2019 om 17:44 schreef M Jane <mskiesewetter(at)gmail(dot)com>:

> Hi,
> I will do the test tomorrow morning with your suggestions and will capture
> the logs and steps.
> Thanks for the advice.
> regards,
> Marian
>
>
> Op ma 11 mrt. 2019 om 15:55 schreef Jehan-Guillaume (ioguix) de Rorthais <
> ioguix(at)free(dot)fr>:
>
>> On Mon, 11 Mar 2019 12:17:59 +0100
>> M Jane <mskiesewetter(at)gmail(dot)com> wrote:
>>
>> > Hi,
>> > Thanks for your reply
>> > I will try to put in all the logging I collected when doing the
>> switchover.
>> > I did the following:
>> > 1. Check if the standby is uptodate
>> > I ran the following query on the standby:
>> > SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
>> > pg_last_xlog_replay_location()) AS replication_delay_bytes;
>> > This resulted in 0.
>>
>> This is not required. You don't need to check if some received XLOG are
>> not
>> replayed on data yet. As far as the XLOG is on disk on the standby side,
>> it will
>> be replayed before promoting by default. Quoting doc:
>>
>> pg_last_xlog_receive_location():
>> Get last transaction log location received and synced to disk by
>> streaming
>> replication.
>>
>> In this step, you should check the standby is connected to the primary.
>> Either
>> be listing your processes on both side, check pg_stat_replication, etc.
>>
>> > 2. I shutdown the primary instance
>> > and checked the alertlog of the primary:
>> [...]
>> > < 2019-03-08 09:46:41.159 CET >LOG: received smart shutdown request
>> > < 2019-03-08 09:46:41.160 CET >LOG: autovacuum launcher shutting down
>> > < 2019-03-08 09:46:41.160 CET >FATAL: terminating connection due to
>> > administrator command
>>
>> What is this FATAL? Did you issue some more administrative commands after
>> the
>> initial "pg_ctl -m smart stop"?
>>
>> I would recommend using "pg_ctl -m fast stop" and nothing else.
>>
>> > 3. Then i promoted the slave with pg_ctl promote and checked the
>> alertlog:
>>
>> What was the last received LSN? What was the last WAL file the standby
>> retrieved from archives?
>>
>> Do you have log lines before "LOG: selected new timeline ID: 4" ?
>>
>> [...]
>> > 4. On the former primary I created a recovery.conf
>> > cat recovery.conf
>> > restore_command = 'cat archive_dir/%f | gunzip > %p'
>>
>> "cat" is useless here. I recommend: "gzip -dc archive_dir/%f > %p"
>>
>> > recovery_target_timeline = 'latest'
>> > standby_mode = on
>> > primary_conninfo = 'host=newprimary port=5432 user=replication
>> password=xxxx'
>> > trigger_file = '/tmp/instance.trigger.5432'
>>
>> You should not put the trigger file in a public accessible folder. Anyone
>> can
>> triggers the promotion...
>> You should only rely on "pg_ctl promote" if possible.
>>
>> > 5. And started te instance on the former primary
>> > alertlog former primary:
>> [...]
>> > < 2019-03-08 09:53:04.591 CET >FATAL: requested timeline 4 is not a
>> child
>> > of this server's history
>> > < 2019-03-08 09:53:04.591 CET >DETAIL: Latest checkpoint is at
>> 2/A8000028
>> > on timeline 3, but in the history of the requested timeline, the server
>> > forked off from that timeline at 2/A70000D0.
>>
>> According to these log lines:
>>
>> * during shutdown, old primary archived 0000000300000002000000A7 and
>> wrote its
>> shutdown checkpoint as first xlog record in 0000000300000002000000A8.
>> The
>> following command should show it:
>>
>> pg_xlogdump "pg_xlog/0000000300000002000000A8"
>>
>> * for some reasons, the standby never received any content from
>> 0000000300000002000000A8. Neither be streaming rep, nor restored from
>> archives
>>
>> * the promoted standby forked its timeline from the last received LSN
>> that was
>> in early 0000000300000002000000A7 file (2/A70000D0)
>>
>> * you end up with a new master that forked (probably) right before the
>> shutdown
>> checkpoint of the old master (as far as the next record in 2/A70000D0
>> was an
>> XLOG SWITCH).
>>
>> In conclusion, the old master is right: the fork happen before its own
>> last
>> writes.
>>
>> > Am I missing something?
>>
>> You should really take care of the first steps. Why the standby did not
>> received the shutdown checkpoint? Was it connected when you issued the
>> "pg_ctl
>> stop"? Did the wal_sender processus been killed? A network issue? Why the
>> standby has not been able to restore the last segment from archives before
>> promoting?
>>
>> I usually compare the last checkpoint LSN in the controldata from the old
>> master with what received the standby to promote, before promoting it.
>>
>> > Please advise?
>> > Thanks in advance.
>> > regards,
>> > Marian
>> >
>> > Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais
>> <
>> > ioguix(at)free(dot)fr>:
>> >
>> > > On Fri, 8 Mar 2019 22:10:15 +0530
>> > > Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com> wrote:
>> > >
>> > > > Hi Marian,
>> > > >
>> > > > Yes, We have faced the same issue, generally, this type of issue
>> occurs
>> > > due
>> > > > to missing of history file i.e(00000000.history) during the
>> replication
>> > > > setup using the pg_basebackup.
>> > > >
>> > > > To resolve this type issue you need to create an empty history file
>> with
>> > > > name 00000000.history on the master server then rebuild old master
>> by
>> > > > initiating the pg_basebackup.
>> > >
>> > > Uh, what? Do you have further info about this? I fail to understand
>> how
>> > > this is
>> > > related to OP question...
>> > >
>> > > > On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter(at)gmail(dot)com>
>> wrote:
>> > > ...
>> > > > > But bringing up the old primary as a slave doesn't work. I get
>> errors (
>> > > > > like >FATAL: requested timeline 5 is not a child of this server's
>> > > history)
>> > > > > and the instances shuts down automatically.
>> > >
>> > > While doing a clean switchover, I never seen this (yet).
>> > >
>> > > How are you checking both servers are up-to-date? Do you check that
>> the
>> > > standby
>> > > received everything from the old primary, including its shutdown
>> > > checkpoint,
>> > > before promoting?
>> > >
>> > > Make sure your standby staid connected to your old primary until the
>> later
>> > > is
>> > > completely shutdown-ed. Nothing should disconnect them during the
>> whole
>> > > process. And make sure to use mode fast or smart to stop the old
>> primary.
>> > >
>> > > Are you able to share with us the restart checkpoint on the old
>> > > primary and the history file created by the new-primary? How far are
>> the
>> > > restart LSN and timeline fork from each other?
>> > >
>> > > Regards,
>> > >
>>
>>
>>
>> --
>> Jehan-Guillaume de Rorthais
>> Dalibo
>>
>

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Pierre Ochsenbein 2019-03-12 18:00:45 pgsql auto
Previous Message M Jane 2019-03-11 16:44:52 Re: Postgresql 9.6.12 switchover failing