Re: WAL receive process dies

From: Patrick Krecker <patrick(at)judicata(dot)com>
To: Craig Ringer <craig(at)2ndquadrant(dot)com>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: WAL receive process dies
Date: 2014-08-29 20:04:43
Message-ID: CAK2mJFNK1aCNtGOqWAbf8fU3r-7y+yRa8VQ=9F=m0jx=d5CXeQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hi Craig -- Sorry for the late response, I've been tied up with some other
things for the last day. Just to give some context, this is a machine that
sits in our office and replicates from another read slave in production via
a tunnel set up with spiped. The spiped tunnel is working and postgres is
still stuck (it has been stuck since 8-25).

The last moment that replication was working was 2014-08-25
22:06:05.03972. We have a table called replication_time with one column and
one row that has a timestamp that is updated every second, so it's easy to
tell the last time this machine was in sync with production.

recovery.conf: http://pastie.org/private/dfmystgf0wxgtmahiita
logs: http://pastie.org/private/qt1ixycayvdsxafrzj0l0q

Currently the WAL receive process is still not running. Interestingly,
another pg instance running on the same machine is replicating just fine.

A note about that: there is another instance running on that machine and a
definite race condition with restore_wal_s3.py, which writes the file to
/tmp before copying it to the destination requested by postgres (I just
discovered this today, this is not generally how we run our servers). So,
if both are restoring at the same time, they will step on the WAL archives
being unzipped in /tmp and bad things will happen. But, interestingly, I
checked the logs for the other machine and there is no activity on that
day. It does not appear that the WAL replay was invoked or that the WAL
receive timed out.

As for enabling the core dump, it seems that it needs to be done when
Postgres starts, and thought I would leave it running in its "stuck" state
for now. However, if you know how to enable it on a running process, let me
know. We are running Ubuntu 13.10.

On Wed, Aug 27, 2014 at 11:30 PM, Craig Ringer <craig(at)2ndquadrant(dot)com>
wrote:

> On 08/28/2014 09:39 AM, Patrick Krecker wrote:
> > We have a periodic network connectivity issue (unrelated to Postgres)
> > that is causing the replication to fail.
> >
> > We are running Postgres 9.3 using streaming replication. We also have
> > WAL archives available to be replayed with restore_command. Typically
> > when I bring up a slave it copies over WAL archives for a while before
> > connecting via streaming replication.
> >
> > When I notice the machine is behind in replication, I also notice that
> > the WAL receiver process has died. There didn't seem to be any
> > information in the logs about it.
>
> What did you search for?
>
> Do you have core dumps enabled? That'd be a good first step. (Exactly
> how to do this depends on the OS/distro/version, but you basically want
> to set "ulimit -c unlimited" on some ancestor of the postmaster).
>
> > 1. It seems that Postgres does not fall back to copying WAL archives
> > with its restore_command. I just want to confirm that this is what
> > Postgres is supposed to do when its connection via streaming replication
> > times out.
>
> It should fall back.
>
> > 2. Is it possible to restart replication after the WAL receiver process
> > has died without restarting Postgres?
>
> PostgreSQL should do so its self.
>
> Please show your recovery.conf (appropriately redacted) and
> postgresql.conf for the replica, and complete logs for the time period
> of interest. You'll want to upload the logs somewhere then link to them,
> do not attach them to an email to the list.
>
> --
> Craig Ringer http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Training & Services
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Andres Freund 2014-08-29 21:11:25 Re: WAL receive process dies
Previous Message Adrian Klaver 2014-08-29 19:18:56 Re: Single Table Report With Calculated Column