Re: Interesting streaming replication issue

From: Andres Freund <andres(at)anarazel(dot)de>
To: James Sewell <james(dot)sewell(at)jirotech(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: Interesting streaming replication issue
Date: 2017-08-09 22:08:11
Message-ID: 20170809220811.ekhxxlhyse5mvf5c@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi,

On 2017-07-27 13:00:17 +1000, James Sewell wrote:
> Hi all,
>
> I've got two servers (A,B) which are part of a streaming replication pair.
> A is the master, B is a hot standby. I'm sending archived WAL to a
> directory on A, B is reading it via SCP.
>
> This all works fine normally. I'm on Redhat 7.3, running EDB 9.6.2 (I'm
> currently working to reproduce with standard 9.6)
>
> We have recently seen a situation where B does not catch up when taken
> offline for maintenance.
>
> When B is started we see the following in the logs:
>
> 2017-07-27 11:56:03 AEST [21432]: [990-1] user=,db=,client=
> (0:00000)LOG: restored log file "0000000C0000005A000000B5" from
> archive
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 11:56:03 AEST [46191]: [1-1] user=,db=,client=
> (0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on
> timeline 12
> 2017-07-27 11:56:03 AEST [46191]: [2-1] user=,db=,client=
> (0:XX000)FATAL: could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000C0000005A000000B5 has already been
> removed
>
> scp: /archive/xlog//0000000D.history: No such file or directory
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 11:56:04 AEST [46203]: [1-1] user=,db=,client=
> (0:00000)LOG: started streaming WAL from primary at 5A/B5000000 on
> timeline 12
> 2017-07-27 11:56:04 AEST [46203]: [2-1] user=,db=,client=
> (0:XX000)FATAL: could not receive data from WAL stream: ERROR:
> requested WAL segment 0000000C0000005A000000B5 has already been
> removed
>
> This will loop indefinitely. At this stage the master reports no connected
> standbys in pg_stat_replication, and the standby has no running WAL
> receiver process.
>
> This can be 'fixed' by running pg_switch_xlog() on the master, at which
> time a connection is seen from the standby and the logs show the following:
>
> scp: /archive/xlog//0000000D.history: No such file or directory
> 2017-07-27 12:03:19 AEST [21432]: [1029-1] user=,db=,client= (0:00000)LOG:
> restored log file "0000000C0000005A000000B5" from archive
> scp: /archive/xlog//0000000C0000005A000000B6: No such file or directory
> 2017-07-27 12:03:19 AEST [63141]: [1-1] user=,db=,client= (0:00000)LOG:
> started streaming WAL from primary at 5A/B5000000 on timeline 12
> 2017-07-27 12:03:19 AEST [63141]: [2-1] user=,db=,client= (0:XX000)FATAL:
> could not receive data from WAL stream: ERROR: requested WAL segment
> 0000000C0000005A000000B5 has already been removed
>
> scp: /archive/xlog//0000000D.history: No such file or directory
> 2017-07-27 12:03:24 AEST [21432]: [1030-1] user=,db=,client= (0:00000)LOG:
> restored log file "0000000C0000005A000000B5" from archive
> 2017-07-27 12:03:24 AEST [21432]: [1031-1] user=,db=,client= (0:00000)LOG:
> restored log file "0000000C0000005A000000B6" from archive

FWIW, I don't see a bug here. Archiving on its own doesn't guarantee
that replication progresses in increments smaller than 16MB, unless you
use archive_timeout (or as you do manually switch segments). Streaming
replication doesn't guarantee that WAL is retained unless you use
replication slots - which you don't appear to be. You can make SR retain
more with approximate methods like wal_keep_segments too, but that's not
a guarantee. From what I can see you're just seeing the combination of
these two limitations, because you don't use the methods to address them
(archive_timeout, replication slots and/or wal_keep_segments).

Greetings,

Andres Freund

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Andres Freund 2017-08-09 22:26:38 Re: streaming replication - crash on standby
Previous Message Seong Son (US) 2017-08-09 22:03:43 streaming replication - crash on standby