From: | Patrick Krecker <patrick(at)judicata(dot)com> |
---|---|
To: | Andres Freund <andres(at)2ndquadrant(dot)com>, Chris Hundt <chris(at)judicata(dot)com> |
Cc: | Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org |
Subject: | Re: WAL receive process dies |
Date: | 2014-08-29 22:37:51 |
Message-ID: | CAK2mJFMz7ST1BBL4_CJnoiRkWdhppy5yDeDmS5SQTm+mjVHs3w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-general |
On Fri, Aug 29, 2014 at 2:11 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
wrote:
> On 2014-08-29 13:04:43 -0700, Patrick Krecker wrote:
> > Hi Craig -- Sorry for the late response, I've been tied up with some
> other
> > things for the last day. Just to give some context, this is a machine
> that
> > sits in our office and replicates from another read slave in production
> via
> > a tunnel set up with spiped. The spiped tunnel is working and postgres is
> > still stuck (it has been stuck since 8-25).
> >
> > The last moment that replication was working was 2014-08-25
> > 22:06:05.03972. We have a table called replication_time with one column
> and
> > one row that has a timestamp that is updated every second, so it's easy
> to
> > tell the last time this machine was in sync with production.
> >
> > recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
> > logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q
>
> The problem is this log entry:
>
> 2014-08-27 18:44:27 PDT ERROR: requested starting point 175/28000000 is
> ahead of the WAL flush position of this server 174/B76D16A8
>
> That's the walreceiver connecting to the upstream node and askign for
> wAL. Somehow the standby has gotten *considerably* ahead of the node
> it's trying to receive WAL from.
>
> Are you sure primary_conninfo / the spiped tunnel connects to the right
> postgres instance? Did you possibly have a failover or something like that?
>
> Greetings,
>
> Andres Freund
>
> --
> Andres Freund http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>
We didn't have a failover. We just have one master and replica in different
AZs on Amazon EC2.
I ran the following on the local endpoint of spiped:
while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c "select
modtime, pg_last_xlog_receive_location(), pg_last_xlog_replay_location()
from replication_time"; done;
And the same command on production and I was able to verify that the xlogs
for a given point in time were the same (modtime is updated every second by
an upstart job):
spiped from office -> production:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
2014-08-29 15:23:25.563766 | 177/2E80C9F8 | 177/2E80C9F8
Ran directly on production replica:
modtime | pg_last_xlog_receive_location |
pg_last_xlog_replay_location
----------------------------+-------------------------------+------------------------------
2014-08-29 15:23:25.563766 | 177/2E80C9F8 | 177/2E80C9F8
To me, this is sufficient proof that spiped is indeed talking to the
machine I think it's talking to (also lsof reports the correct hostname).
I created another basebackup from the currently stuck postgres intance on
another machine and I also get this error:
2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream:
ERROR: requested starting point 177/2D000000 is ahead of the WAL flush
position of this server 174/B76D16A8
However, this new instance is able to fetch logs from S3 and replay them
without issue.
Is it possible that the data dir on the stuck instance is just corrupt? It
is not impossible for this to have happened at some point in the past due
to the race condition in fetching logs from S3 I mentioned above.
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2014-08-29 22:46:38 | Re: WAL receive process dies |
Previous Message | Emi Lu | 2014-08-29 21:29:30 | Re: alter column to varchar without view drop/re-creation |