From: | Thom Brown <thom(at)linux(dot)com> |
---|---|
To: | Robert Haas <robertmhaas(at)gmail(dot)com> |
Cc: | PostgreSQL-development <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Standby receiving part of missing WAL segment |
Date: | 2015-02-12 14:56:29 |
Message-ID: | CAA-aLv4LcGXy1KSV2JtN94-SPefzD4Dq_=J2zMOGNs3cmZRwfA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On 12 February 2015 at 13:56, Robert Haas <robertmhaas(at)gmail(dot)com> wrote:
> On Wed, Feb 11, 2015 at 12:55 PM, Thom Brown <thom(at)linux(dot)com> wrote:
> > Today I witnessed a situation which appears to have gone down like this:
> >
> > - The primary server starting streaming WAL data from segment 00A8 to the
> > standby
> > - The standby server started receiving that data
> > - Before 00A8 is finished, the wal sender process dies on the primary,
> but
> > the archiver process continues, and 00A8 ends up being archived as usual
> > - The primary continues to generate WAL and cleans up old WAL files from
> > pg_xlog until 00A8 is gone.
> > - The primary is restarted and the wal sender process is back up and
> running
> > - The standby says "waiting for 00A8", which it can no longer get from
> the
> > primary
> > - 00A8 is in the standby's archive directory, but the standby is waiting
> for
> > the rest of the segment from the primary via streaming replication, so
> > doesn't check the archive
> > - The standby is restarted
> > - The standby goes back into recovery and eventually replays 00A8 and
> > continues as normal.
> >
> > Should the standby be able to get feedback from the primary that the
> > requested segment is no longer available, and therefore know to check its
> > archive?
>
> Last time I played around with this, if the standby requested a
> segment from the master that was no longer present there, the standby
> would immediately get an ERROR, which it seems like would get you out
> of trouble. I wonder why that didn't happen in your case.
Yeah, I've tried recreating this like so:
- Primary streams to standby like usual
- Kill -9 primary then change its port and bring it back up
- Create traffic on primary until it no longer has the WAL file the standby
wants, but has archived it
- Change the port of the primary back to what the standby is trying to talk
to
But before it gets to that 4th point, the standby has gone to the archive
for the rest of it:
cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or
directory
2015-02-12 14:47:52 GMT [8280]: [1-1] user=,db=,client= FATAL: could not
connect to the primary server: could not connect to server: Connection
refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 5488?
cp: cannot stat ‘/tmp/walarch/0000000100000006000000C8’: No such file or
directory
2015-02-12 14:47:57 GMT [8283]: [1-1] user=,db=,client= FATAL: could not
connect to the primary server: could not connect to server: Connection
refused
Is the server running on host "127.0.0.1" and accepting
TCP/IP connections on port 5488?
2015-02-12 14:48:02 GMT [8202]: [6-1] user=,db=,client= LOG: restored log
file "0000000100000006000000C8" from archive
I don't suppose this is something that was buggy in 9.3.1?
Thom
From | Date | Subject | |
---|---|---|---|
Next Message | Robert Haas | 2015-02-12 15:07:13 | Re: Manipulating complex types as non-contiguous structures in-memory |
Previous Message | Alexander Korotkov | 2015-02-12 14:55:43 | Re: GSoC 2015 - mentors, students and admins. |