PG 9.3.12: Replication appears to have worked, but getting error messages in logs

From: David Caldwell <david(at)porkrind(dot)org>
To: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: PG 9.3.12: Replication appears to have worked, but getting error messages in logs
Date: 2016-04-03 05:50:12
Message-ID: 4db92cca-fd08-b015-eed8-436b1a1421cf@porkrind.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello,

We're using streaming replication. Our technique for spinning up a db
slave is this:

rsync from master (gross copy)
pg_start_backup() on server
rsync from master (correct copy)
pg_stop_backup()
drop in recovery.conf into slave directory
enable hot_standby in slave conf
start slave

After starting the slave, I'm getting this error every 5 seconds in the log:

ERROR: requested WAL segment 0000000100000E2200000005 has already been
removed

But I can connect to the DB and make queries and new records are
definitely streaming in.

I thought I just didn't have enough wal segments, so I bumped up the
number on the master and restarted the process. It just finished the
second time and the exact same error message is in the logs again (same
wal segment number).

When I ran pg_start_backup() and pg_stop_backup(), the output was:

pg_start_backup
-----------------
E27/3100A200

pg_stop_backup
----------------
E28/7D357950

The backup_label file looked like this:

START WAL LOCATION: E27/3100A200 (file 0000000100000E2700000031)
CHECKPOINT LOCATION: E27/31C9C740
BACKUP METHOD: pg_start_backup
BACKUP FROM: master
START TIME: 2016-04-02 12:34:25 PDT
LABEL: clone

During the rsync it copied
pg_xlog/0000000100000E25000000F1
to
pg_xlog/0000000100000E2800000071

So I'm confused: why is the E22 wal being requested? It seems to predate
the backup by a lot.

Does the slave really contain all the data? If not, how can I tell what
is missing (and why is it accepting streaming data if it's missing
something)?

One more piece of the puzzle that may or may not be relevant:

The current master used to be streaming replication slave. The original
master had a disk failure and so we switched one of the backup slaves
into a master. We've replaced the disk on the original server and we're
now trying to make it a streaming replication slave. This is the part
that's failing. If I do rough estimates of how fast the Exx number is
incrementing and compute backwards, E22 seems like about the time of the
original disk failure, give or take.

Thanks,
David

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Adrian Klaver 2016-04-03 15:31:52 Re: plpgsql update row from record variable
Previous Message Soni M 2016-04-03 03:46:46 Re: CORRUPTION on TOAST table