RE: Slave stuck in recovery mode

From: "Nicolas Ross" <rossnick-lists(at)cybercat(dot)net>
To: <pgsql-admin(at)lists(dot)postgresql(dot)org>
Subject: RE: Slave stuck in recovery mode
Date: 2021-10-09 19:36:35
Message-ID: 011501d7bd44$f8b70a10$ea251e30$@cybercat.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

I ended up googling some more and found this :

https://www.enterprisedb.com/blog/be-sure-stop-your-backups

Which is exactly what was happening. Even though I had no backup running, I did the stop_backup command etc.

I planned a restart of the master server, and then re-cloned, I was then all OK !

Strange.

-----Message d'origine-----
De : Nicolas Ross <rossnick-lists(at)cybercat(dot)net>
Envoyé : 8 octobre 2021 19:16
À : pgsql-admin(at)lists(dot)postgresql(dot)org
Objet : Slave stuck in recovery mode

Hi !

We’ve been using postgres since some time now (since the
9.3 days).

I’ve got a pair of 9.6 server with 2 nodes, a primary and a
slave. We use repmgr to manage the cluster. When it was
installed, it was something like repmgr 4.x or even 3.

This week, for some reason, I had to rebuild the slave
instance. So I cloned the slave using a command like :

/usr/pgsql-9.6/bin/repmgr -h pgserver2.qualite -U repmgr -f
/etc/repmgr/9.6/repmgr.conf standby clone

After some time (it’s like 250 gigs, so it’s kinda an hour
or 2), the command ends.

If I start the postgres server on the slave with OS
systemcl script, it doesn’t return to the CLI (presumably
it waits for something).

In the log I see :

< 2021-10-08 16:16:47.861 EDT > LOG: database system was
shut down in recovery at 2021-10-08 16:04:10 EDT
< 2021-10-08 16:16:47.877 EDT > LOG: entering standby mode
< 2021-10-08 16:16:48.599 EDT > LOG: redo starts at
13BF/CF000028
< 2021-10-08 16:16:52.899 EDT > LOG: consistent recovery
state reached at 13BF/D53BA0F0
(Some time passes)
< 2021-10-08 16:46:10.363 EDT > LOG: started streaming WAL
from primary at 13C9/8C000000 on timeline 1

After that, if I try to connect to the slave, I get :

FATAL: the database system is starting up

No matter how long I wait (tried more than a day later).

During that time, the master still streams the wal to the
slave.

Notes :

That last log example was taken after trying to clone from
our barman server (tried with and without)

use_replication_slots is set to yes.

hot_standby is on on the primary, hence when cloned it is
also.

Before one of my clone command, I’ve tried cleaning all
residue of repmgr, ie remove the extension, re-register the
master, etc, still the same issue.

If I comment out hot_standby on the slave, it starts
normally, but still doesn’t allow connections.

Recovery.conf is :

standby_mode = 'on'
primary_conninfo = 'host=MASTERIP user=repmgr
application_name=SLAVENAME'
recovery_target_timeline = 'latest'
primary_slot_name = 'repmgr_slot_1'

Any help troubleshooting this would be appreciated !

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Ron 2021-10-10 06:10:23 Re: 13.4 on RDS, SSL SYSCALL EOF on restore
Previous Message Wells Oliver 2021-10-09 18:20:14 Re: 13.4 on RDS, SSL SYSCALL EOF on restore