From: | John Lumby <johnlumby(at)hotmail(dot)com> |
---|---|
To: | pgsql general <pgsql-general(at)postgresql(dot)org> |
Cc: | "Shulgin, Oleksandr" <oleksandr(dot)shulgin(at)zalando(dot)de>, "michael(dot)paquier(at)gmail(dot)com" <michael(dot)paquier(at)gmail(dot)com>, "adrian(dot)klaver(at)aklaver(dot)com" <adrian(dot)klaver(at)aklaver(dot)com> |
Subject: | Re: how to switch old replication Master to new Standby after promoting old Standby - pg_rewind log file missing |
Date: | 2016-03-15 16:14:13 |
Message-ID: | COL131-W91DE11587CD443E0847F6DA3890@phx.gbl |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
I have already hit a different problem with pg_rewind and would like to check my understanding.
The problem is, on 9.5.1 , pg_rewind fails to find a log file that it thinks it needs :
pg_rewind -D ${pg_cluster_dir} --source-server='host=10.19.0.1 port=5432 xxx' -P --debug
_____________________________________________________
connected to server
fetched file "global/pg_control", length 8192
fetched file "pg_xlog/00000003.history", length 85
servers diverged at WAL position 0/2B000060 on timeline 2
could not open file "/mnt/septcomn/sysbuild/pgrepl95/pg_xlog/00000002000000000000002A": No such file or directory
could not find previous WAL record at 0/2A0034C8
Failure, exiting
______________________________________________________
Well, my postgresql servers have archive off (I don't need archiving, only replication)
and that particular file, 00000002000000000000002A , is no longer present on either new Primary or this old Primary.
so it has gone for good.
But my question is, given that the divergence point was 2B000060,
why is it looking for a file earlier than that?
Ah well ...
By the way, this scenario was on a system which started off life as Standby (running ok),
was promoted to new Primary (ok) while the original Primary was recycled as new Standby (ok),
so one complete failover completed ok, and just now I tried to flip them back to original roles -
current Standby promoted (ok) but now hit failure trying to recycle current Primary back to Standby.
I don't know if this is significant - i.e. my second question is,
is it more likely that a needed(?) log file goes missing when performing the second failover in a sequence of flips?
pg_controldata output if relevant :
from current just-promoted Primary
pg_control version number: 942
Catalog version number: 201510051
Database system identifier: 6261999204763226337
Database cluster state: in production
pg_control last modified: Tue 15 Mar 2016 11:43:26 AM EDT
Latest checkpoint location: 0/33000098
Prior checkpoint location: 0/32000098
Latest checkpoint's REDO location: 0/33000060
Latest checkpoint's REDO WAL file: 000000030000000000000033
Latest checkpoint's TimeLineID: 3
Latest checkpoint's PrevTimeLineID: 3
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/1624
Latest checkpoint's NextOID: 17820
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 616
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 1624
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Tue 15 Mar 2016 11:43:26 AM EDT
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: hot_standby
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
from previous Primary I am trying and failing to rewind :
pg_control version number: 942
Catalog version number: 201510051
Database system identifier: 6261999204763226337
Database cluster state: shut down
pg_control last modified: Tue 15 Mar 2016 11:34:36 AM EDT
Latest checkpoint location: 0/32000108
Prior checkpoint location: 0/32000098
Latest checkpoint's REDO location: 0/32000108
Latest checkpoint's REDO WAL file: 000000020000000000000032
Latest checkpoint's TimeLineID: 2
Latest checkpoint's PrevTimeLineID: 2
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0/1624
Latest checkpoint's NextOID: 17820
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 616
Latest checkpoint's oldestXID's DB: 1
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 1
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint: Tue 15 Mar 2016 11:34:36 AM EDT
Fake LSN counter for unlogged rels: 0/1
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: hot_standby
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value
Data page checksum version: 0
Cheers, John
----------------------------------------
> Subject: Re: [GENERAL] Re: how to switch old replication Master to new Standby after promoting old Standby
> To: johnlumby(at)hotmail(dot)com; michael(dot)paquier(at)gmail(dot)com
> CC: oleksandr(dot)shulgin(at)zalando(dot)de; pgsql-general(at)postgresql(dot)org
> From: adrian(dot)klaver(at)aklaver(dot)com
> Date: Tue, 15 Mar 2016 07:05:43 -0700
>
> On 03/15/2016 06:54 AM, John Lumby wrote:
>>
>> If anyone who has such privileges would like to edit the page,
>> here is what I would add to it -- feel free to edit/rewrite
>>
>> after the bullet
>>
>> . How to restart streaming replication after failover
>>
>> and before the sub-bullet
>>
>> Repeat the operations from 6th;
>>
>> add this:
>>
>> Starting with the old Standby now running as Unreplicated and the old Primary shut down but servicable,
>> with its databases intact, the task is to put the old Primary into Standby mode
>> as rapidly and unintrusively as possible.
>> This implies not requiring to shut new Primary down and not requiring to make another full base backup.
>> A utility named pg_rewind makes this much simpler and more robust, and it is included in standard
>> postgresql distribution since 9.5. - it is documented under PostgreSQL Server Applications.
>>
>> To use pg_rewind :
>> First and most important, it is essential to have *previously* set the configuration parameter
>> wal_log_hints = on
>> in both the old Primary and old Standby, *before* the failover.
>> An alternative is described in the documentation but setting this parameter is simpler.
>> If you did not set this or the alternative, then , set it for future,
>> and don't use pg_rewind this time. See next.
>> Secondly , note that pg_rewind will potentially update *every* file in the old Primary cluster,
>> including configuration files. It is likely that configuration files may not match exactly on the two systems,
>> so make a copy of postgresql.conf and postgresql.auto.conf for later restore.
>> Thirdly, double-check that old Primary is shut down.
>> Now run pg_rewind on old Primary using the form
>>
>> pg_rewind -D ${pg_cluster_dir} \
>> --source-server="host=${source_server_ip} port=${source_server_port} user=${replication_user} password=${replication_password}" \
>> -P
>>
>> You can add --debug if you want a blow-by-blow account of every change it makes.
>>
>> Now restore your configuration files, first perhaps comparing what differences there were.
>> Finally, create the recovery.conf for the new Standby
>>
>> You can now start the new Standby.
>>
>> There are some limitations with pg_rewind described in documentation.
>> If you could not use it or it failed, then treat your old Primary as an empty cluster
>> and commission it from the start as described next
>>
>>
>> Cheers, John Lumby
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2016-03-15 16:18:49 | Re: pg_dump crashing |
Previous Message | Matthias Schmitt | 2016-03-15 15:10:44 | pg_dump crashing |