Quick Links

BUG #14538: streaming replication same wal missing

From:	vodevsh(at)gmail(dot)com
To:	pgsql-bugs(at)postgresql(dot)org
Subject:	BUG #14538: streaming replication same wal missing
Date:	2017-02-09 13:58:08
Message-ID:	20170209135808.1405.57354@wrigleys.postgresql.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

The following bug has been logged on the website:

Bug reference: 14538
Logged by: Vladimir Svedov
Email address: vodevsh(at)gmail(dot)com
PostgreSQL version: 9.3.11
Operating system: Linux
Description:

Hi,
I posted formatted text as a question
here:http://dba.stackexchange.com/questions/163735/streaming-replication-same-wal-missing

Several days ago I saw this in master log:

replica 2017-02-06 16:10:55 UTC ERROR: requested WAL segment
000000030000096D00000052 has already been removed
This message was repeating until I stopped slave.

Checking pg_stat_activity on prod showed autovacuum task on table ~250GB, I
decided that it produced huge amount of WALs. Checking them confirmed my
assumption - all wal_keep_segments=500 were last minute. So I thought that
network bandwidth was not sufficient to send WALs fast enough to slave to
replay them. Master has archive_command = cd ., so no way back.

I did not panic, I recreate replication. I:

delete all data files on slave
pg_start_backup() on prod
rsync -zahv data
pg_start slave
pg_stop_backup() on prod and check logs.
Imagine my desperation to see that slave has this:

-bash-4.2$ head -n 20 /pg/data93/pg_log/postgresql-2017-02-09_121514.log
LOG: could not open usermap file "/pg/data93/pg_ident.conf": No such file
or directory
LOG: database system was shut down in recovery at 2017-02-09 12:15:07 UTC
LOG: entering standby mode
LOG: redo starts at 982/39074640
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
LOG: consistent recovery state reached at 982/3AEC60F8
LOG: unexpected pageaddr 97C/83EC8000 in log segment
00000003000009820000003A, offset 15499264
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 982/3A000000 on timeline 3
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
and so on... 3 days passed! That segment has gone, but why is it looking for
it??? I copied whole data directory.

All pages are readable (pg_dump works on all dbs), changes from master go to
slave. ERROR keeps appearing each several seconds on slave.

Please, tell me I did something stupid and it is not a bug. It happened on
prod.

Responses

Re: BUG #14538: streaming replication same wal missing at 2017-02-10 01:48:34 from Michael Paquier

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2017-02-09 17:43:43	Re: BUG #14537: Gist index irrational growth
Previous Message	mcinp	2017-02-09 13:25:36	BUG #14537: Gist index irrational growth