From: | Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au> |
---|---|
To: | "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Missing WAL file after running pg_rewind |
Date: | 2018-01-11 16:58:02 |
Message-ID: | ab82d7fd35ef4394bc5dfc6a6e2f1266@ITUPW-EXMBOX3B.UniNet.unisa.edu.au |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
Hi
We had a failover situation where our monitoring watchdog processes promoted the slave to become the new master.
I restarted the old master database to ensure a clean stop/start and performed pg_rewind on the old master to resync with the new master. However, after successful rewind, there was an error restarting the new slave.
The steps I took were:
1. Stop all watchdogs
2. Start/stop the old master
3. Run 'checkpoint' on new master
4. Run the pg_rewind on old master to resync with new master
5. Start the old master (as new slave)
Step 4 pg_rewind was successful with the new slave rewind to the same new timeline of the new master, however during the restart of the new slave it failed to start with the following errors:
80) FATAL: the database system is starting up
cp: cannot stat '/pg_backup/backup/archive_sync/0000000400000383000000BF': No such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0000000300000383000000BF': No such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0000000200000383000000BF': No such file or directory
cp: cannot stat '/pg_backup/backup/archive_sync/0000000100000383000000BF': No such file or directory
2018-01-11 23:21:59 ACDT [112235]: [1-1] db=,user= app=,host= LOG: started streaming WAL from primary at
383/BE000000 on timeline 6
2018-01-11 23:21:59 ACDT [112235]: [2-1] db=,user= app=,host= FATAL: could not receive data from WAL stre
am: ERROR: requested WAL segment 0000000600000383000000BE has already been removed
I checked the both the archive and pg_xlog directories on the new master and cannot locate missing file.
Has anyone experience this before with pg_rewind?
The earliest wall files in the archive directory was around just after the failover occurred.
Eg, in the archive directory on the new Master:
$ ls -l
total 15745032
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000500000383000000C0.partial
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C0
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C1
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C2
-rw-------. 1 postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C
And on the pg_xlog directory on the new Master:
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000080
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000081
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000082
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000083
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000084
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000085
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000086
-rw-------. 1 postgres postgres 16777216 Jan 11 18:57 000000060000038500000087
Thanks
Dylan
From | Date | Subject | |
---|---|---|---|
Next Message | David G. Johnston | 2018-01-11 16:58:32 | Re: Multiple central connection service files |
Previous Message | Curt Tilmes | 2018-01-11 16:23:17 | Multiple central connection service files |