Missing WAL file after running pg_rewind

From: Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au>
To: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Missing WAL file after running pg_rewind
Date: 2018-01-11 16:58:02
Message-ID: ab82d7fd35ef4394bc5dfc6a6e2f1266@ITUPW-EXMBOX3B.UniNet.unisa.edu.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi

We had a failover situation where our monitoring watchdog processes promoted the slave to become the new master.
I restarted the old master database to ensure a clean stop/start and performed pg_rewind on the old master to resync with the new master. However, after successful rewind, there was an error restarting the new slave.
The steps I took were:

1. Stop all watchdogs

2. Start/stop the old master

3. Run 'checkpoint' on new master

4. Run the pg_rewind on old master to resync with new master

5. Start the old master (as new slave)

Step 4 pg_rewind was successful with the new slave rewind to the same new timeline of the new master, however during the restart of the new slave it failed to start with the following errors:

80) FATAL: the database system is starting up
cp: cannot stat '/pg_backup/backup/archive_sync/0000000400000383000000BF': No such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0000000300000383000000BF': No such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0000000200000383000000BF': No such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0000000100000383000000BF': No such file or directory
2018-01-11 23:21:59 ACDT [112235]: [1-1] db=,user= app=,host= LOG: started streaming WAL from primary at
383/BE000000 on timeline 6
2018-01-11 23:21:59 ACDT [112235]: [2-1] db=,user= app=,host= FATAL: could not receive data from WAL stre
am: ERROR: requested WAL segment 0000000600000383000000BE has already been removed

I checked the both the archive and pg_xlog directories on the new master and cannot locate missing file.

Has anyone experience this before with pg_rewind?

The earliest wall files in the archive directory was around just after the failover occurred.

Eg, in the archive directory on the new Master:
$ ls -l
total 15745032
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000500000383000000C0.partial
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C0
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C1
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C2
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C

And on the pg_xlog directory on the new Master:
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000080
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000081
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000082
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000083
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000084
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000085
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000086
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000087

Thanks
Dylan

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David G. Johnston 2018-01-11 16:58:32 Re: Multiple central connection service files
Previous Message Curt Tilmes 2018-01-11 16:23:17 Multiple central connection service files