Re: WAL receive process dies

From: Patrick Krecker <patrick(at)judicata(dot)com>
To: Chris Hundt <chris(at)judicata(dot)com>
Cc: Andres Freund <andres(at)2ndquadrant(dot)com>, Craig Ringer <craig(at)2ndquadrant(dot)com>, pgsql-general(at)postgresql(dot)org
Subject: Re: WAL receive process dies
Date: 2014-09-01 23:53:03
Message-ID: CAK2mJFM-eeNJ3DFvprhkMhm+gT1fUU9Z4AxZsJ3eLRi8zJjVzQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Sat, Aug 30, 2014 at 7:04 AM, Chris Hundt <chris(at)judicata(dot)com> wrote:

> (One of Patrick's coworkers here; thanks a lot for your assistance)
>
> Just in case you wanted this as well, I ran
>
> psql 'replication=1 dbname=XXX host=127.0.0.1 port=5432 user=XXX
> password=XXX' -c 'IDENTIFY_SYSTEM;'
>
> (5432 is the stuck replica and 5445 is the pipe to the working replica)
>
> systemid | timeline | xlogpos
> ---------------------+----------+--------------
> 5964163898407843711 | 1 | 174/B76D16A8
>
>
> On Fri, Aug 29, 2014 at 7:26 PM, Patrick Krecker <patrick(at)judicata(dot)com>
> wrote:
>
>>
>>
>>
>> On Fri, Aug 29, 2014 at 3:46 PM, Andres Freund <andres(at)2ndquadrant(dot)com>
>> wrote:
>>
>>> [FWIW: proper quoting makes answering easier and thus more likely]
>>>
>>> On 2014-08-29 15:37:51 -0700, Patrick Krecker wrote:
>>> > I ran the following on the local endpoint of spiped:
>>> >
>>> > while [ true ]; do psql -h localhost -p 5445 judicata -U marbury -c
>>> "select
>>> > modtime, pg_last_xlog_receive_location(),
>>> pg_last_xlog_replay_location()
>>> > from replication_time"; done;
>>> >
>>> > And the same command on production and I was able to verify that the
>>> xlogs
>>> > for a given point in time were the same (modtime is updated every
>>> second by
>>> > an upstart job):
>>> >
>>> > spiped from office -> production:
>>> > modtime | pg_last_xlog_receive_location |
>>> > pg_last_xlog_replay_location
>>> >
>>> ----------------------------+-------------------------------+------------------------------
>>> > 2014-08-29 15:23:25.563766 | 177/2E80C9F8 |
>>> 177/2E80C9F8
>>> >
>>> > Ran directly on production replica:
>>> > modtime | pg_last_xlog_receive_location |
>>> > pg_last_xlog_replay_location
>>> >
>>> ----------------------------+-------------------------------+------------------------------
>>> > 2014-08-29 15:23:25.563766 | 177/2E80C9F8 |
>>> 177/2E80C9F8
>>> >
>>> > To me, this is sufficient proof that spiped is indeed talking to the
>>> > machine I think it's talking to (also lsof reports the correct
>>> hostname).
>>> >
>>> > I created another basebackup from the currently stuck postgres intance
>>> on
>>> > another machine and I also get this error:
>>> >
>>> > 2014-08-29 15:27:30 PDT FATAL: could not receive data from WAL stream:
>>> > ERROR: requested starting point 177/2D000000 is ahead of the WAL flush
>>> > position of this server 174/B76D16A8
>>>
>>> Uh. this indicates that the machine you're talking to is *not* one of
>>> the above as it has a flush position of '174/B76D16A8' - not something
>>> that's really possible when the node actually is at '177/2E80C9F8'.
>>>
>>> Could you run, on the standby that's having problems, the following
>>> command:
>>> psql 'host=127.0.0.1 port=5445 user=XXX password=XXX' -c
>>> 'IDENTIFY_SYSTEM;'
>>>
>>> Greetings,
>>>
>>> Andres Freund
>>>
>>> --
>>> Andres Freund http://www.2ndQuadrant.com/
>>> PostgreSQL Development, 24x7 Support, Training & Services
>>>
>>
>> RE: quoting, I wonder if Gmail is messing it up somehow? Or am I doing
>> something else wrong? Sorry :(
>>
>> First, I apologize for the misleading information, but when I made
>> another basebackup and tried to use it, I configured the machine to cascade
>> from the stuck replica, *not* from the spiped endpoint. When I properly
>> connected it to the spiped endpoint it synced up fine, giving this log line:
>>
>> 2014-08-29 16:16:21 PDT LOG: started streaming WAL from primary at
>> 177/4F000000 on timeline 1
>>
>> The command as you gave reported a syntax error as is, but I googled a
>> little bit and run this one:
>>
>> psql 'replication=1 dbname=XXX host=127.0.0.1 port=5445 user=XXX
>> password=XXX' -c 'IDENTIFY_SYSTEM;'
>>
>> And it gave me this output:
>>
>> systemid | timeline | xlogpos
>> ---------------------+----------+--------------
>> 5964163898407843711 | 1 | 177/53091990
>>
>
>
I I restarted the server and it promptly started fetching WAL archives from
our backups [1]. I am still at a loss as to why the binary replication
couldn't reconnect and why Postgres didn't fall back to fetching WAL
archives with its restore_command.

[1] http://pastebin.com/ZxM1LsNw

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Philip Rhoades 2014-09-02 03:05:08 Displaying a wide column with psql
Previous Message David G Johnston 2014-09-01 18:18:22 Re: copymanager question