From: | vodevsh(at)gmail(dot)com |
---|---|
To: | pgsql-bugs(at)postgresql(dot)org |
Subject: | BUG #14538: streaming replication same wal missing |
Date: | 2017-02-09 13:58:08 |
Message-ID: | 20170209135808.1405.57354@wrigleys.postgresql.org |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
The following bug has been logged on the website:
Bug reference: 14538
Logged by: Vladimir Svedov
Email address: vodevsh(at)gmail(dot)com
PostgreSQL version: 9.3.11
Operating system: Linux
Description:
Hi,
I posted formatted text as a question
here:http://dba.stackexchange.com/questions/163735/streaming-replication-same-wal-missing
Several days ago I saw this in master log:
replica 2017-02-06 16:10:55 UTC ERROR: requested WAL segment
000000030000096D00000052 has already been removed
This message was repeating until I stopped slave.
Checking pg_stat_activity on prod showed autovacuum task on table ~250GB, I
decided that it produced huge amount of WALs. Checking them confirmed my
assumption - all wal_keep_segments=500 were last minute. So I thought that
network bandwidth was not sufficient to send WALs fast enough to slave to
replay them. Master has archive_command = cd ., so no way back.
I did not panic, I recreate replication. I:
delete all data files on slave
pg_start_backup() on prod
rsync -zahv data
pg_start slave
pg_stop_backup() on prod and check logs.
Imagine my desperation to see that slave has this:
-bash-4.2$ head -n 20 /pg/data93/pg_log/postgresql-2017-02-09_121514.log
LOG: could not open usermap file "/pg/data93/pg_ident.conf": No such file
or directory
LOG: database system was shut down in recovery at 2017-02-09 12:15:07 UTC
LOG: entering standby mode
LOG: redo starts at 982/39074640
FATAL: the database system is starting up
FATAL: the database system is starting up
FATAL: the database system is starting up
LOG: consistent recovery state reached at 982/3AEC60F8
LOG: unexpected pageaddr 97C/83EC8000 in log segment
00000003000009820000003A, offset 15499264
LOG: database system is ready to accept read only connections
LOG: started streaming WAL from primary at 982/3A000000 on timeline 3
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
ERROR: requested WAL segment 000000030000096D00000052 has already been
removed
and so on... 3 days passed! That segment has gone, but why is it looking for
it??? I copied whole data directory.
All pages are readable (pg_dump works on all dbs), changes from master go to
slave. ERROR keeps appearing each several seconds on slave.
Please, tell me I did something stupid and it is not a bug. It happened on
prod.
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2017-02-09 17:43:43 | Re: BUG #14537: Gist index irrational growth |
Previous Message | mcinp | 2017-02-09 13:25:36 | BUG #14537: Gist index irrational growth |