Postgres HA issue - Standby server does not start after Master compute host is shut down

From: Nagendra Bandi <nagendra(dot)bandi(at)gmail(dot)com>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Postgres HA issue - Standby server does not start after Master compute host is shut down
Date: 2019-09-03 14:57:37
Message-ID: CAO8W-N9ZPacOAJ4PqaPxM7KCEMca-4LbQBOd1_iXJ=o6gJWvCg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Dear community,
Looking for help on HA issue. If you could provide any hints on what I am
missing that would be greatly appreciated.

*Problem Description :*

PostgreSQL database is set up for High Availability and *asynchronous
streaming replication* with *hot standby* as described in
https://www.postgresql.org/docs/9.4/warm-standby.html#STREAMING-REPLICATION.
There is only 1 master and 1 standby server set up. The deployment is on
Openstack cloud where master database is running on one compute host and
standby server is running on another compute host. When master is brought
down (*compute host is powered off*), stand by is *not* able to take over.
The PostgreSQL database server on the standby fails to start with errors
messages given below.

*Error logs from Standby Postgres DB:*

*Logs form postgres log :
*< 2019-08-22 16:00:04.110 UTC >LOG: invalid record length at C/513BF150

< 2019-08-22 16:00:04.110 UTC >LOG: redo done at C/513BF118

< 2019-08-22 16:00:04.115 UTC >FATAL: WAL ends before end of online backup

< 2019-08-22 16:00:04.115 UTC >HINT: All WAL generated while online backup
was taken must be available at recovery.

< 2019-08-22 16:00:04.116 UTC >LOG: startup process (PID 10450) exited with
exit code 1

< 2019-08-22 16:00:04.116 UTC >LOG: aborting startup due to startup process
failure

< 2019-08-22 16:00:06.692 UTC >LOG: database system was interrupted while
in recovery at 2019-08-22 16:00:04 UTC

< 2019-08-22 16:00:06.692 UTC >HINT: This probably means that some data is
corrupted and you will have to use the last backup for recovery.

< 2019-08-22 16:00:06.777 UTC >LOG: database system was not properly shut
down; automatic recovery in progress

< 2019-08-22 16:00:06.780 UTC >LOG: redo starts at C/513BF078

< 2019-08-22 16:00:06.780 UTC >LOG: invalid record length at C/513BF150

< 2019-08-22 16:00:06.780 UTC >LOG: redo done at C/513BF118

< 2019-08-22 16:00:06.783 UTC >FATAL: WAL ends before end of online backup

< 2019-08-22 16:00:06.783 UTC >HINT: All WAL generated while online backup
was taken must be available at recovery.

< 2019-08-22 16:00:06.785 UTC >LOG: startup process (PID 10645) exited with
exit code 1

< 2019-08-22 16:00:06.785 UTC >LOG: aborting startup due to startup process
failure

*NOTE :*

This issue is seen only when "compute host" is brought down with database
size of about 2GB. Tests are repeated few times and issue is
reproducible. *However,
issue is NOT seen when VM is brought down. i.e. When master is brought down
(by making respective "VM" is powered off), standby is able to take over.*

*Alternatives Tried :*
Changed the streaming replication from asynchronous mode to synchronous
mode (as described in
https://www.postgresql.org/docs/9.4/warm-standby.html#SYNCHRONOUS-REPLICATION)
but even this alternative gave the same erroneous behaviour. Logs attached
for this sync case as well.

*Version Info :*

POSTGRES Version : 9.4. Community edition.

OS Version: RHEL 7.2

Openstack version: Newton

*Config Files:*

master_postgresql.conf : postgresql.conf on the Master.

standby_postgresql.conf : postgresql.conf on the Standby.

standby_recovery.conf : recovery.conf on the Standby.

*Log Files :*

Log File name for async replication test case:
20190901_102820_pg_standby_debug_Async_replication.log

Description:

a. Enabled debug on Postgres configuration by setting “*log_min_messages
= debug5*”.

b. The time when the test case is executed is “*2019-09-01 10:28:20*”,
when the trigger file is found by Postgres.

Log File name for synchronous replication test case:
20190901_154356_pg_standby_debug_synchronous_replication.log

Description:

a) Enabled debug on Postgres configuration by setting “log_min_messages
= debug5”.

b) The time when test case is executed is “*2019-09-01 15:43:56*”, when
the trigger file is found by Postgres.

Appreciate your help.

Regards,
Nagendra

Attachment Content-Type Size
standby_postgresql.conf application/octet-stream 28.4 KB
standby_recovery.conf application/octet-stream 4.9 KB
master_postgresql.conf application/octet-stream 28.4 KB
20190901_154356_pg_standby_debug_synchronous_replication.log application/octet-stream 185.6 KB
20190901_102820_pg_standby_debug_Async_replication.log application/octet-stream 1.2 MB

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Imre Samu 2019-09-03 15:02:41 Re: Slow statement using parallelism after 9.6>11 upgrade
Previous Message Paul Ramsey 2019-09-03 14:39:41 Re: Slow statement using parallelism after 9.6>11 upgrade