From: | "Nicolas Ross" <rossnick-lists(at)cybercat(dot)net> |
---|---|
To: | <pgsql-admin(at)lists(dot)postgresql(dot)org> |
Subject: | RE: Slave stuck in recovery mode |
Date: | 2021-10-09 19:36:35 |
Message-ID: | 011501d7bd44$f8b70a10$ea251e30$@cybercat.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
I ended up googling some more and found this :
https://www.enterprisedb.com/blog/be-sure-stop-your-backups
Which is exactly what was happening. Even though I had no backup running, I did the stop_backup command etc.
I planned a restart of the master server, and then re-cloned, I was then all OK !
Strange.
-----Message d'origine-----
De : Nicolas Ross <rossnick-lists(at)cybercat(dot)net>
Envoyé : 8 octobre 2021 19:16
À : pgsql-admin(at)lists(dot)postgresql(dot)org
Objet : Slave stuck in recovery mode
Hi !
We’ve been using postgres since some time now (since the
9.3 days).
I’ve got a pair of 9.6 server with 2 nodes, a primary and a
slave. We use repmgr to manage the cluster. When it was
installed, it was something like repmgr 4.x or even 3.
This week, for some reason, I had to rebuild the slave
instance. So I cloned the slave using a command like :
/usr/pgsql-9.6/bin/repmgr -h pgserver2.qualite -U repmgr -f
/etc/repmgr/9.6/repmgr.conf standby clone
After some time (it’s like 250 gigs, so it’s kinda an hour
or 2), the command ends.
If I start the postgres server on the slave with OS
systemcl script, it doesn’t return to the CLI (presumably
it waits for something).
In the log I see :
< 2021-10-08 16:16:47.861 EDT > LOG: database system was
shut down in recovery at 2021-10-08 16:04:10 EDT
< 2021-10-08 16:16:47.877 EDT > LOG: entering standby mode
< 2021-10-08 16:16:48.599 EDT > LOG: redo starts at
13BF/CF000028
< 2021-10-08 16:16:52.899 EDT > LOG: consistent recovery
state reached at 13BF/D53BA0F0
(Some time passes)
< 2021-10-08 16:46:10.363 EDT > LOG: started streaming WAL
from primary at 13C9/8C000000 on timeline 1
After that, if I try to connect to the slave, I get :
FATAL: the database system is starting up
No matter how long I wait (tried more than a day later).
During that time, the master still streams the wal to the
slave.
Notes :
That last log example was taken after trying to clone from
our barman server (tried with and without)
use_replication_slots is set to yes.
hot_standby is on on the primary, hence when cloned it is
also.
Before one of my clone command, I’ve tried cleaning all
residue of repmgr, ie remove the extension, re-register the
master, etc, still the same issue.
If I comment out hot_standby on the slave, it starts
normally, but still doesn’t allow connections.
Recovery.conf is :
standby_mode = 'on'
primary_conninfo = 'host=MASTERIP user=repmgr
application_name=SLAVENAME'
recovery_target_timeline = 'latest'
primary_slot_name = 'repmgr_slot_1'
Any help troubleshooting this would be appreciated !
From | Date | Subject | |
---|---|---|---|
Next Message | Ron | 2021-10-10 06:10:23 | Re: 13.4 on RDS, SSL SYSCALL EOF on restore |
Previous Message | Wells Oliver | 2021-10-09 18:20:14 | Re: 13.4 on RDS, SSL SYSCALL EOF on restore |