Quick Links

Hot Standby Replica Recovery Problem

From:	Michael Harris <michael(dot)harris(at)ericsson(dot)com>
To:	"pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject:	Hot Standby Replica Recovery Problem
Date:	2018-12-14 07:40:20
Message-ID:	AM3PR07MB3710BA2C4D59043F79469D6F3A10@AM3PR07MB371.eurprd07.prod.outlook.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Hello Experts,

We've been having a strange problem with one of our databases.

A summary of the setup follows:

- We are running postgresql 9.6.9 on Centos 7.
- We are using postgresql native streaming replication
- There is one master and one hot standby
- The master is archiving it's WAL files with this archive command:

archive_command = 'test ! -f %p || (test ! -f /var/backup2/dbs1a/pg_xlog/%f.gz && pigz -c %p > /var/backup2/dbs1a/pg_xlog/%f.gz)' # command to use to archive a logfile segment

/var/backup2 is a shared gfs2 filesystem.

- Recently we had to do some maintenance that involved rebuilding the database servers.
At that point dbs1b was the master, so we promoted dbs1a, did the work on dbs1b,
and then began establishing a new replica on dbs1b using pg_basebackup

- After pg_basebackup completed we set up recovery.conf:

restore_command='gzip -dc /var/backup2/dbs1a/pg_xlog/%f.gz > %p'
standby_mode='on'
primary_conninfo='host=dbs1a user=xxxxxx password=xxxxx'

and then started postgresql.

At first it seemed to be starting up as per normal. It began REDO-ing WAL files from
the archive. It reached a certain point and then failed with this:

[2018-12-14 11:07:39 AEDT] 2050@ LOG: restored log file "000000030002C4D700000028" from archive
[2018-12-14 11:07:39 AEDT] 29240@ LOG: fetching timeline history file for timeline 3 from primary server
[2018-12-14 11:07:39 AEDT] 29240@ FATAL: could not receive timeline history file from the primary server: ERROR: could not open file "pg_xlog/00000003.history": No such file or directory

These messages repeated every few seconds until we stopped the postmaster.

I was surprised because we have done this procedure many times on this & other databases, but anyway
Dr Google suggested this https://superuser.com/questions/1127360/cant-find-the-timeline-history-file-to-get-the-replication-working.
After reading that and other linked articles I decided to create a 'dummy' history file.

This time the startup sequence failed like this:

[2018-12-14 16:26:46 AEDT] 16575@ LOG: restored log file "000000030002C4D700000028" from archive
[2018-12-14 16:26:47 AEDT] 17669@ LOG: fetching timeline history file for timeline 3 from primary server
[2018-12-14 16:26:47 AEDT] 17669@ LOG: started streaming WAL from primary at 2C4D7/28000000 on timeline 3
[2018-12-14 16:26:47 AEDT] 17669@ FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000030002C4D700000028 has already been removed

Again I was surprised - why wasn't it getting this file from the archives? It is present and does not
seem to be truncated or corrupted (as far as I can tell).

Anyway in an attempt to force it to get the file from the archives, I removed the primary_conninfo.
This time it just repeatedly fetched the file 000000030002C4D700000028 again and again, never progressing
to the next one.

Finally I turned up the debugging output, to see this:

[2018-12-14 17:58:25 AEDT] 12365@ DEBUG: executing restore command "gzip -dc /var/backup2/dbs1a/pg_xlog/000000030002C4D700000028.gz > pg_xlog/RECOVERYXLOG"
[2018-12-14 17:58:25 AEDT] 12365@ LOG: restored log file "000000030002C4D700000028" from archive
[2018-12-14 17:58:25 AEDT] 12365@ DEBUG: got WAL segment from archive
[2018-12-14 17:58:25 AEDT] 12365@ DEBUG: switched WAL source from archive to stream after failure
[2018-12-14 17:58:25 AEDT] 12365@ DEBUG: switched WAL source from stream to archive after failure

It seems like something is going wrong with the WAL file replay here, although I cannot see what??

I'm writing this email after the second attempt at replication: the first attempt ended much the same.
I assumed I had done something wrong so I re-started from the pg_basebackup. Now I am concerned
that something is wrong with the master, or maybe with our WAL archiving process.

Any tips would be gratefully received!

Regards
Mike

Responses

RE: Hot Standby Replica Recovery Problem at 2018-12-24 11:04:03 from Michael Harris

Browse pgsql-general by date

	From	Date	Subject
Next Message	Oleg	2018-12-14 11:33:38	libpq PQexecParams & value placeholders
Previous Message	Guillaume Lelarge	2018-12-14 07:04:56	Re: Installing pg_buffercache (was Re: shared_buffers on Big RAM systems)