Re: Replication failure, slave requesting old segments

From: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
To: Phil Endecott <spam_from_pgsql_lists(at)chezphil(dot)org>, pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Re: Replication failure, slave requesting old segments
Date: 2018-08-11 18:30:00
Message-ID: d5e0ad50-801d-0cf8-e9d6-7e0f7e5f6d68@aklaver.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 08/11/2018 08:18 AM, Phil Endecott wrote:
> Dear Experts,
>
> I recently set up replication for the first time. It seemed to be
> working OK in my initial tests, but didn't cope when the slave was
> down for a longer period. This is all with the Debian stable
> packages of PostgreSQL 9.6. My replication setup involves a third
> server, "backup", to and from which WAL files are copied using scp
> (and essentially never deleted), plus streaming replication in
> "hot standby" mode.
>
> On the master, I have:
>
> wal_level = replica
> archive_mode = on
> archive_command = 'ssh backup test ! -f backup/postgresql/archivedir/%f &&
> scp %p backup:backup/postgresql/archivedir/%f'
>
> On the slave I have:
>
> standby_mode = 'on'
> primary_conninfo = 'user=postgres host=master port=5432'
> restore_command = 'scp backup:backup/postgresql/archivedir/%f %p'
>
> hot_standby = on
>
>
> This all seemed to work OK until the slave was shut down for a few days.
> (During this time the master was also down for a shorter period.)
> When the slave came back up it successfully copied several WAL files
> from the backup server until it reached one that hasn't been created yet:
>
> 2018-08-10 22:37:38.322 UTC [615] LOG: restored log file "0000000100000007000000CC" from archive
> 2018-08-10 22:38:02.011 UTC [615] LOG: restored log file "0000000100000007000000CD" from archive
> 2018-08-10 22:38:29.802 UTC [615] LOG: restored log file "0000000100000007000000CE" from archive
> 2018-08-10 22:38:55.973 UTC [615] LOG: restored log file "0000000100000007000000CF" from archive
> 2018-08-10 22:39:24.964 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
>
> At this point there was a temporary problem due to a firewall rule that
> blocked the replication connection to the master:
>
> 2018-08-10 22:41:35.865 UTC [1289] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
>
> The slave then entered a loop, retrying to scp the next file from the backup
> archive (and failing) and connecting to the master for streaming replication
> (and also failing). That is as expected, except for the "invalid magic number":
>
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:41:37.018 UTC [615] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-10 22:43:46.936 UTC [1445] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
>
> Note that it's complaining about an invalid magic number in log segment D1, which is
> the one that has just failed to scp. That looked suspicious to me and I remembered
> notes in the docs about ensuring that the archive and/or restore commands return an
> error exit status when they fail; I checked that and yes scp does exit(1) when the
> requested file doesn't exist:
>
> $ scp backup:/egheriugherg /tmp/a
> scp: /egheriugherg: No such file or directory
> $ echo $?
> 1
>
> Anyway, the slave continued in its retrying loop as expected, except that each time
> it re-fetched the previous segment, D0; this seems a bit peculiar:
>
> 2018-08-10 22:44:17.796 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-10 22:46:28.727 UTC [1628] FATAL: could not connect to the primary server: SSL connection has been closed unexpectedly
>
> (Note the "invalid magic number" was only reported the first time.)
>
> Eventually I fixed the networking issue that had prevented the replication connection
> to master and it could start streaming:
>
> 2018-08-11 00:05:50.364 UTC [615] LOG: restored log file "0000000100000007000000D0" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:05:51.325 UTC [7208] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:05:51.325 UTC [7208] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed
>
> So here is the real problem which worries me: it seems to ask the master for
> segment D0, which is one that it did successfully restore from the backup earlier
> on, and the master - not unexpectedly - reports that it has already been removed.
>
> I tried restarting the slave:
>
> 2018-08-11 00:12:15.536 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:12:15.797 UTC [7954] LOG: redo starts at 7/D0F956C0
> 2018-08-11 00:12:16.068 UTC [7954] LOG: consistent recovery state reached at 7/D0FFF088
> 2018-08-11 00:12:16.068 UTC [7953] LOG: database system is ready to accept read only connections
> scp: backup/postgresql/archivedir/0000000100000007000000D1: No such file or directory
> 2018-08-11 00:12:17.057 UTC [7954] LOG: invalid magic number 0000 in log segment 0000000100000007000000D1, offset 0
> 2018-08-11 00:12:17.242 UTC [8034] LOG: started streaming WAL from primary at 7/D0000000 on timeline 1
> 2018-08-11 00:12:17.242 UTC [8034] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 0000000100000007000000D0 has already been removed
>
> Note that again it failed to get D1 from the backup, as expected, and then
> reports "invalid magic number" in that file, and again tries to start
> streaming from a segment that the master doesn't have.
>
> Eventually I decided to try and cause the D1 file to exist in the backup:
>
> db=> create table waste_space as select * from large_table;
>
> That caused four new WAL segments to be copied to the backup, and restored
> by the slave:
>
> 2018-08-11 00:55:49.741 UTC [7954] LOG: restored log file "0000000100000007000000D0" from archive
> 2018-08-11 00:56:12.304 UTC [7954] LOG: restored log file "0000000100000007000000D1" from archive
> 2018-08-11 00:56:35.481 UTC [7954] LOG: restored log file "0000000100000007000000D2" from archive
> 2018-08-11 00:56:57.443 UTC [7954] LOG: restored log file "0000000100000007000000D3" from archive
> 2018-08-11 00:57:21.723 UTC [7954] LOG: restored log file "0000000100000007000000D4" from archive
> scp: backup/postgresql/archivedir/0000000100000007000000D5: No such file or directory
> 2018-08-11 00:57:22.915 UTC [7954] LOG: unexpected pageaddr 7/C7000000 in log segment 00000001000000070000
> 00D5, offset 0
> 2018-08-11 00:57:23.114 UTC [12348] LOG: started streaming WAL from primary at 7/D5000000 on timeline 1
>
> Note the "unexpected pageaddr" message. Despite that, the streaming started
> and everything was working correctly. Hurray!
>
> I remain concerned, however, that something bad happened and I'd like to
> understand it. Can anyone help?

Looks like the master recycled the WAL's while the slave could not connect.

Might want to take a look at replication slots to keep WAL's from being
recycled:

https://www.postgresql.org/docs/9.6/static/warm-standby.html#STREAMING-REPLICATION-SLOTS

with the heads up that slots do not limit the space taken up by WAL's.

The "unexpected pageaddr" probably is for same reason, per comments in
xlogreader.c:

"
/*
* Check that the address on the page agrees with what we expected.
* This check typically fails when an old WAL segment is recycled,
* and hasn't yet been overwritten with new data yet.
*/

...

report_invalid_record(state,
"unexpected pageaddr %X/%X in log segment %s,
offset %u",
(uint32) (hdr->xlp_pageaddr >> 32), (uint32) hdr->xlp_pageaddr,
fname,
offset);
"

>
>
> Thanks,
>
> Phil.
>
>
>
>

--
Adrian Klaver
adrian(dot)klaver(at)aklaver(dot)com

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Phil Endecott 2018-08-11 19:42:05 Re: Replication failure, slave requesting old segments
Previous Message Phil Endecott 2018-08-11 15:18:55 Replication failure, slave requesting old segments