Re: Postgresql 9.6.12 switchover failing

From: M Jane <mskiesewetter(at)gmail(dot)com>
To: "Jehan-Guillaume (ioguix) de Rorthais" <ioguix(at)free(dot)fr>
Cc: Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com>, Pgsql-admin <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: Re: Postgresql 9.6.12 switchover failing
Date: 2019-03-11 11:17:59
Message-ID: CANBd27JKrozpm=hnV8hA+5mVhq1vM78D+i7A6Li9bZ-xiZ=ocA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi,
Thanks for your reply
I will try to put in all the logging I collected when doing the switchover.
I did the following:
1. Check if the standby is uptodate
I ran the following query on the standby:
SELECT pg_xlog_location_diff(pg_last_xlog_receive_location(),
pg_last_xlog_replay_location()) AS replication_delay_bytes;
This resulted in 0.
2. I shutdown the primary instance
and checked the alertlog of the primary:
< 2019-03-08 09:45:31.553 CET >LOG: database system was shut down at
2019-03-08 09:45:06 CET
< 2019-03-08 09:45:31.554 CET >FATAL: the database system is starting up
< 2019-03-08 09:45:31.557 CET >LOG: MultiXact member wraparound
protections are now enabled
< 2019-03-08 09:45:31.559 CET >LOG: autovacuum launcher started
< 2019-03-08 09:45:31.559 CET >LOG: database system is ready to accept
connections
< 2019-03-08 09:46:41.159 CET >LOG: received smart shutdown request
< 2019-03-08 09:46:41.160 CET >LOG: autovacuum launcher shutting down
< 2019-03-08 09:46:41.160 CET >FATAL: terminating connection due to
administrator command
< 2019-03-08 09:46:41.162 CET >LOG: shutting down
< 2019-03-08 09:46:41.182 CET >FATAL: the database system is shutting down
< 2019-03-08 09:46:41.332 CET >LOG: checkpoint starting: shutdown immediate
< 2019-03-08 09:46:41.345 CET >LOG: checkpoint complete: wrote 0 buffers
(0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.000
s, sync=0.000 s, total=0.015 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=16384 kB, estimate=16384 kB
< 2019-03-08 09:46:41.357 CET >LOG: database system is shut down

3. Then i promoted the slave with pg_ctl promote and checked the alertlog:
< 2019-03-08 09:49:55.249 CET >LOG: selected new timeline ID: 4
< 2019-03-08 09:49:55.369 CET >LOG: archive recovery complete
< 2019-03-08 09:49:55.377 CET >LOG: restored log file "00000003.history"
from archive
< 2019-03-08 09:49:55.390 CET >LOG: MultiXact member wraparound
protections are now enabled
< 2019-03-08 09:49:55.391 CET >LOG: checkpoint starting: force
< 2019-03-08 09:49:55.393 CET >LOG: autovacuum launcher started
< 2019-03-08 09:49:55.393 CET >LOG: database system is ready to accept
connections
< 2019-03-08 09:49:55.402 CET >LOG: checkpoint complete: wrote 0 buffers
(0.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.000
s, sync=0.000 s, total=0.011 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=0 kB, estimate=29490 kB

4. On the former primary I created a recovery.conf
cat recovery.conf
restore_command = 'cat archive_dir/%f | gunzip > %p'
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=newprimary port=5432 user=replication
password=xxxx'
trigger_file = '/tmp/instance.trigger.5432'

5. And started te instance on the former primary
alertlog former primary:
< 2019-03-08 09:53:04.543 CET >LOG: database system was shut down at
2019-03-08 09:46:41 CET
< 2019-03-08 09:53:04.566 CET >LOG: restored log file "00000004.history"
from archive
cat: archive_dir/00000005.history: No such file or directory

gzip: stdin: unexpected end of file
< 2019-03-08 09:53:04.571 CET >LOG: entering standby mode
< 2019-03-08 09:53:04.577 CET >LOG: restored log file "00000004.history"
from archive
cat: archive_dir/0000000400000002000000A8: No such file or directory

gzip: stdin: unexpected end of file
cat: archive_dir/0000000300000002000000A8: No such file or directory

gzip: stdin: unexpected end of file
< 2019-03-08 09:53:04.591 CET >FATAL: requested timeline 4 is not a child
of this server's history
< 2019-03-08 09:53:04.591 CET >DETAIL: Latest checkpoint is at 2/A8000028
on timeline 3, but in the history of the requested timeline, the server
forked off from that timeline at 2/A70000D0.
< 2019-03-08 09:53:04.592 CET >LOG: startup process (PID 29629) exited
with exit code 1
< 2019-03-08 09:53:04.592 CET >LOG: aborting startup due to startup
process failure
< 2019-03-08 09:53:04.595 CET >LOG: database system is shut down

Am I missing something?
Please advise?
Thanks in advance.
regards,
Marian

Op vr 8 mrt. 2019 om 17:51 schreef Jehan-Guillaume (ioguix) de Rorthais <
ioguix(at)free(dot)fr>:

> On Fri, 8 Mar 2019 22:10:15 +0530
> Shreeyansh Dba <shreeyansh2014(at)gmail(dot)com> wrote:
>
> > Hi Marian,
> >
> > Yes, We have faced the same issue, generally, this type of issue occurs
> due
> > to missing of history file i.e(00000000.history) during the replication
> > setup using the pg_basebackup.
> >
> > To resolve this type issue you need to create an empty history file with
> > name 00000000.history on the master server then rebuild old master by
> > initiating the pg_basebackup.
>
> Uh, what? Do you have further info about this? I fail to understand how
> this is
> related to OP question...
>
> > On Fri, Mar 8, 2019 at 8:11 PM M Jane <mskiesewetter(at)gmail(dot)com> wrote:
> ...
> > > But bringing up the old primary as a slave doesn't work. I get errors (
> > > like >FATAL: requested timeline 5 is not a child of this server's
> history)
> > > and the instances shuts down automatically.
>
> While doing a clean switchover, I never seen this (yet).
>
> How are you checking both servers are up-to-date? Do you check that the
> standby
> received everything from the old primary, including its shutdown
> checkpoint,
> before promoting?
>
> Make sure your standby staid connected to your old primary until the later
> is
> completely shutdown-ed. Nothing should disconnect them during the whole
> process. And make sure to use mode fast or smart to stop the old primary.
>
> Are you able to share with us the restart checkpoint on the old
> primary and the history file created by the new-primary? How far are the
> restart LSN and timeline fork from each other?
>
> Regards,
>

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Jehan-Guillaume (ioguix) de Rorthais 2019-03-11 14:55:36 Re: Postgresql 9.6.12 switchover failing
Previous Message wambacher 2019-03-10 10:57:58 Can't stop pgAdmin4 server process