Slave stuck in recovery mode

From: "Nicolas Ross" <rossnick-lists(at)cybercat(dot)net>
To: pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Slave stuck in recovery mode
Date: 2021-10-08 23:15:55
Message-ID: web-254322712@mail.cybercat.ca
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hi !

We’ve been using postgres since some time now (since the
9.3 days).

I’ve got a pair of 9.6 server with 2 nodes, a primary and a
slave. We use repmgr to manage the cluster. When it was
installed, it was something like repmgr 4.x or even 3.

This week, for some reason, I had to rebuild the slave
instance. So I cloned the slave using a command like :

/usr/pgsql-9.6/bin/repmgr -h pgserver2.qualite -U repmgr -f
/etc/repmgr/9.6/repmgr.conf standby clone

After some time (it’s like 250 gigs, so it’s kinda an hour
or 2), the command ends.

If I start the postgres server on the slave with OS
systemcl script, it doesn’t return to the CLI (presumably
it waits for something).

In the log I see :

< 2021-10-08 16:16:47.861 EDT > LOG: database system was
shut down in recovery at 2021-10-08 16:04:10 EDT
< 2021-10-08 16:16:47.877 EDT > LOG: entering standby mode
< 2021-10-08 16:16:48.599 EDT > LOG: redo starts at
13BF/CF000028
< 2021-10-08 16:16:52.899 EDT > LOG: consistent recovery
state reached at 13BF/D53BA0F0
(Some time passes)
< 2021-10-08 16:46:10.363 EDT > LOG: started streaming WAL
from primary at 13C9/8C000000 on timeline 1

After that, if I try to connect to the slave, I get :

FATAL: the database system is starting up

No matter how long I wait (tried more than a day later).

During that time, the master still streams the wal to the
slave.

Notes :

That last log example was taken after trying to clone from
our barman server (tried with and without)

use_replication_slots is set to yes.

hot_standby is on on the primary, hence when cloned it is
also.

Before one of my clone command, I’ve tried cleaning all
residue of repmgr, ie remove the extension, re-register the
master, etc, still the same issue.

If I comment out hot_standby on the slave, it starts
normally, but still doesn’t allow connections.

Recovery.conf is :

standby_mode = 'on'
primary_conninfo = 'host=MASTERIP user=repmgr
application_name=SLAVENAME'
recovery_target_timeline = 'latest'
primary_slot_name = 'repmgr_slot_1'

Any help troubleshooting this would be appreciated !

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Wells Oliver 2021-10-09 18:20:14 Re: 13.4 on RDS, SSL SYSCALL EOF on restore
Previous Message Alvaro Herrera 2021-10-08 22:48:21 Re: 13.4 on RDS, SSL SYSCALL EOF on restore