RE: Missing WAL file after running pg_rewind

From: Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: RE: Missing WAL file after running pg_rewind
Date: 2018-01-12 21:44:25
Message-ID: a7ad1502b60f4e4fae8ae9e8575b8e83@ITUPW-EXMBOX3B.UniNet.unisa.edu.au
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

The file exist in the archive directory of the old master but it is for the previous timeline, ie 5 and not 6, ie 0000000500000383000000BE.
Can I just rename the file to 6 timeline? Ie 0000000600000383000000BE

-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Friday, 12 January 2018 12:08 PM
To: Dylan Luong <Dylan(dot)Luong(at)unisa(dot)edu(dot)au>
Cc: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Missing WAL file after running pg_rewind

On Thu, Jan 11, 2018 at 04:58:02PM +0000, Dylan Luong wrote:
> The steps I took were:
>
> 1. Stop all watchdogs
>
> 2. Start/stop the old master
>
> 3. Run 'checkpoint' on new master
>
> 4. Run the pg_rewind on old master to resync with new master
>
> 5. Start the old master (as new slave)

That's a sane flow to me.

> 2018-01-11 23:21:59 ACDT [112235]: [2-1] db=,user= app=,host= FATAL:
> could not receive data from WAL stre
> am: ERROR: requested WAL segment 0000000600000383000000BE has already
> been removed
>
> Has anyone experience this before with pg_rewind?

When restarting a standby after a rewind has been done to it, note that, in order to recover to a consistent point, it needs to replay WAL from the previous checkpoint checkpoint where WAL has forked during the promotion up to the point where the rewind has finished. Per your logs, I am getting that the previous checkpoint before the timeline jump is located in segment 0000000X00000383000000BE, but this did not get archived.

> The earliest wall files in the archive directory was around just after the failover occurred.
>
> Eg, in the archive directory on the new Master:
> $ ls -l
> total 15745032
> -rw-------. 1 postgres postgres 16777216 Jan 11 17:52
> 0000000500000383000000C0.partial -rw-------. 1 postgres postgres
> 16777216 Jan 11 17:52 0000000600000383000000C0 -rw-------. 1 postgres
> postgres 16777216 Jan 11 17:52 0000000600000383000000C1 -rw-------. 1
> postgres postgres 16777216 Jan 11 17:52 0000000600000383000000C2

Yeah, you are looking for the WAL segment just before the last, partial WAL segment of the previous timeline. Depending on your archiving strategy, I guess that you should have set archive_mode = 'always' so as the server which was the standby before the promotion is also able to store them.
--
Michael

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message armand pirvu 2018-01-13 03:11:36 Re: characters converted to ??? in postgres
Previous Message pinker 2018-01-12 16:05:11 Re: pg_basebackup is taking more time than expected