Re: Interesting streaming replication issue

From: James Sewell <james(dot)sewell(at)jirotech(dot)com>
To: Gunnar Nick Bluth <gunnar(dot)bluth(at)pro-open(dot)de>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Interesting streaming replication issue
Date: 2017-07-27 06:55:44
Message-ID: CAANVwEuR_3qm2=owJDSkN2HYTS1Yk1inL0qsjJx2mGbCiY8jRQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Thu, Jul 27, 2017 at 4:41 PM, Gunnar "Nick" Bluth <
gunnar(dot)bluth(at)pro-open(dot)de> wrote:

>
> are you sure you're scp'ing from the archive, not from pg_xlog?
>

Yes:

restore_command = 'scp -o StrictHostKeyChecking=no
10.154.19.30:/archive/xlog//%f
%p'

Although you are right - that would almost make sense if I had done that!

Cheers,
James

>
> Regards,
>
> Gunnar "Nick" Bluth
>
> Am 27. Juli 2017 05:00:17 MESZ schrieb James Sewell <
> james(dot)sewell(at)jirotech(dot)com>:
>
>> Hi all,
>>
>> I've got two servers (A,B) which are part of a streaming replication
>> pair. A is the master, B is a hot standby. I'm sending archived WAL to a
>> directory on A, B is reading it via SCP.
>>
>> This all works fine normally. I'm on Redhat 7.3, running EDB 9.6.2 (I'm
>> currently working to reproduce with standard 9.6)
>>
>> We have recently seen a situation where B does not catch up when taken
>> offline for maintenance.
>>
>> When B is started we see the following in the logs:
>>
>> 2017-07-27 11:55:57 AEST [21432]: [979-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000AA" from archive
>> 2017-07-27 11:55:58 AEST [21432]: [980-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000AB" from archive
>> 2017-07-27 11:55:58 AEST [21432]: [981-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000AC" from archive
>> 2017-07-27 11:55:59 AEST [21432]: [982-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000AD" from archive
>> 2017-07-27 11:55:59 AEST [21432]: [983-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000AE" from archive
>> 2017-07-27 11:56:00 AEST [21432]: [984-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000AF" from archive
>> 2017-07-27 11:56:00 AEST [21432]: [985-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B0" from archive
>> 2017-07-27 11:56:01 AEST [21432]: [986-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B1" from archive
>> 2017-07-27 11:56:01 AEST [21432]: [987-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B2" from archive
>> 2017-07-27 11:56:02 AEST [21432]: [988-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B3" from archive
>> 2017-07-27 11:56:02 AEST [21432]: [989-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B4" from archive
>> 2017-07-27 11:56:03 AEST [21432]: [990-1] user=,db=,client= (0:00000)LOG: restored log file "0000000C0000005A000000B5" from archive
>> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
>> 2017-07-27 11:56:03 AEST [46191]: [1-1] user=,db=,client= (0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on timeline 12
>> 2017-07-27 11:56:03 AEST [46191]: [2-1] user=,db=,client= (0:XX000)FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000C0000005A000000B5 has already been removed
>>
>> scp: /archive/xlog//0000000D.history: No such file or directory
>> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
>> 2017-07-27 11:56:04 AEST [46203]: [1-1] user=,db=,client= (0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on timeline 12
>> 2017-07-27 11:56:04 AEST [46203]: [2-1] user=,db=,client= (0:XX000)FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000C0000005A000000B5 has already been removed
>>
>> This will loop indefinitely. At this stage the master reports no
>> connected standbys in pg_stat_replication, and the standby has no
>> running WAL receiver process.
>>
>> This can be 'fixed' by running pg_switch_xlog() on the master, at which
>> time a connection is seen from the standby and the logs show the following:
>>
>> scp: /archive/xlog//0000000D.history: No such file or directory
>> 2017-07-27 12:03:19 AEST [21432]: [1029-1] user=,db=,client=
>> (0:00000)LOG: restored log file "0000000C0000005A000000B5" from archive
>> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
>> 2017-07-27 12:03:19 AEST [63141]: [1-1] user=,db=,client= (0:00000)LOG:
>> started streaming WAL from primary at 5A/B5000000 on timeline 12
>> 2017-07-27 12:03:19 AEST [63141]: [2-1] user=,db=,client=
>> (0:XX000)FATAL: could not receive data from WAL stream: ERROR: requested
>> WAL segment 0000000C0000005A000000B5 has already been removed
>>
>> scp: /archive/xlog//0000000D.history: No such file or directory
>> 2017-07-27 12:03:24 AEST [21432]: [1030-1] user=,db=,client=
>> (0:00000)LOG: restored log file "0000000C0000005A000000B5" from archive
>> 2017-07-27 12:03:24 AEST [21432]: [1031-1] user=,db=,client=
>> (0:00000)LOG: restored log file "0000000C0000005A000000B6" from archive
>> scp: /archive/xlog//0000000C0000005A000000B7: No such file or directory
>> 2017-07-27 12:03:25 AEST [21432]: [1032-1] user=,db=,client=
>> (0:00000)LOG: unexpected pageaddr 56/E7000000 in log segment
>> 0000000C0000005A000000B7, offset 0
>> 2017-07-27 12:03:25 AEST [63196]: [1-1] user=,db=,client= (0:00000)LOG:
>> started streaming WAL from primary
>>
>> At this time replication is running as normal and all changes are
>> streamed.
>>
>> Am I missing something here, this seems very odd. One thing I have
>> noticed is it only seems to be caused after a lot of WAL is produced and
>> the pg_xlog directory is sitting at max_wal_size
>>
>> James Sewell,
>> PostgreSQL Team Lead / Solutions Architect
>>
>>
>>
>> Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
>> *P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com
>> *F *(+61) 2 8099 9099 <(+61)%202%208099%209000>
>>
>> ------------------------------
>> The contents of this email are confidential and may be subject to legal
>> or professional privilege and copyright. No representation is made that
>> this email is free of viruses or other defects. If you have received this
>> communication in error, you may not copy or distribute any part of it or
>> otherwise disclose its contents to anyone. Please advise the sender of your
>> incorrect receipt of this correspondence.
>
>
> --
> Diese Nachricht wurde von meinem Android-Mobiltelefon mit K-9 Mail
> gesendet.
>

--

------------------------------
The contents of this email are confidential and may be subject to legal or
professional privilege and copyright. No representation is made that this
email is free of viruses or other defects. If you have received this
communication in error, you may not copy or distribute any part of it or
otherwise disclose its contents to anyone. Please advise the sender of your
incorrect receipt of this correspondence.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Nick Brennan 2017-07-27 07:11:30 Re: Indexes being ignored after upgrade to 9.5
Previous Message Gunnar "Nick" Bluth 2017-07-27 06:41:49 Re: Interesting streaming replication issue