From: | Simon Riggs <simon(at)2ndquadrant(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Richard Schmidt <Richard(dot)Schmidt(at)metservice(dot)com>, "pgsql-general(at)lists(dot)postgresql(dot)org" <pgsql-general(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Pg_rewind cannot load history wal |
Date: | 2018-08-04 06:44:59 |
Message-ID: | CANP8+j+fR79iX=39LRnrwne7Wp5xDmRyLBnmQ5=4ZUjscNDbag@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On 3 August 2018 at 21:59, Michael Paquier <michael(at)paquier(dot)xyz> wrote:
> On Wed, Aug 01, 2018 at 09:09:30PM +0000, Richard Schmidt wrote:
>> Our procedure that runs on machine A and B is as follows:
>>
>> 1. Build new databases on A and B, and configure A as Primary and B
>> as Standby databases.
>> 2. Make some changes to the A (the primary) and check that they are
>> replicated to the B (the standby)
>> 3. Promote B to be the new primary
>> 4. Switch of the A (the original primary)
>> 5. Add the replication slot to B (the new primary) for A (soon to
>> be standby)
>> 6. Add a recovery.conf to A (soon to be standby). File contains
>> recovery_target_timeline = 'latest' and restore_command = 'cp
>> /ice-dev/wal_archive/%f "%p"
>> 7. Run pg_rewind on A - this appears to work as it returns the
>> message 'source and target cluster are on the same timeline no
>> rewind required';
>> 8. Start up server A (now a slave)
>
> Step 7 is incorrect here, after promotion of B you should see pg_rewind
> actually do its work. The problem is that you are missing a piece in
> your flow in the shape of a checkpoint on the promoted standby to run
> after 3 and before step 7. This makes the promoted standby update its
> timeline number in the on-disk control file, which is used by pg_rewind
> to check if a rewind needs to happen or not.
>
> We see too many reports of such mistakes, I am going to propose a patch
> on the -hackers mailing list to mention that in the documentation...
I think the problem is that writing the online checkpoint is deferred
after promotion, so this is a timing issue that probably doesn't show
in our regression tests.
Sounds like we should write a pending timeline change to the control
file and have pg_rewind check that instead.
I'd call this a timing bug, not a doc issue.
--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2018-08-04 06:56:54 | Re: Pg_rewind cannot load history wal |
Previous Message | Tom Lane | 2018-08-03 22:22:56 | Re: Eror while dropping a user |