From: | Jürgen Strobel <juergen+postgresql(at)strobel(dot)info> |
---|---|
To: | |
Cc: | PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org> |
Subject: | BUG #14321: pg_basebackup --xlog-method=stream fails |
Date: | 2016-09-10 00:10:45 |
Message-ID: | CALWJi_eA9X5K5z_OS58F_3j+WmQ0-UKKy+Z0e8qxXCcNkPDhjQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 10 September 2016 at 00:09, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:
> On Sat, Sep 10, 2016 at 1:58 AM, <juergen+postgresql(at)strobel(dot)info> wrote:
> > The filsystem backup continues successfully to its end, but it concludes
> > without the necessary WAL files. I verified in pg_stat_replication that
> > pg_basebackup is not trying to reconnect to the master.
> >
> > I understand how to repair this manually and it's not an end-of-the-world
> > bug, but it would be nice if pg_basebackup would just reconnect the
> > streaming WAL connection in the same way as pg_receivexlog does.
> Especially
> > as that error happens in a long script run by cron and/or other people
> who
> > do not have this insight.
>
> Perhaps. The source server logs do prove the fact that pg_basebackup
> is requesting for missing WAL segments, right?
>
> > I haven't had time to try 9.6's --slot option yet, but I suspect this
> won't
> > be a full cure either unless it also changes the re-connect behavior.
>
> If what you are seeing missing are the first WAL segments that your
> backup needs, first the backup you took will be useless if you don't
> have a WAL archive from where recovery could fetch those missing
> segments. And in this case --slot will definitely help, but just be
> sure that this does not bloat your pg_xlog partition if disk space is
> a concern there.
> --
> Michael
>
First, I do have another WAL archive (usually).
But no I only see the first WAL segments up to the point when the problem
occurs, then nothing more.
The timeline as far as I can tell is:
1. pg_basebackup --xlog-method=stream starts and creates 2 connections for
backup and WAL streaming.
2. The VM's crappy IO system hickups and stalls the whole VM for a
surprisingly long time.
3. The server runs into wal_sender_timeout and closes the WAL streaming
connection.
4. pg_basebackup prints the warning, and continues the filesystem copy,
*but makes no effort to re-open the WAL streaming connection*. With ps I
see zombie child of the pg_basbackup process, I assume that's the one doing
the WAL streaming.
5. pg_baseback finishes up with the second half of pg_xlog missing, and the
DB fails to start.
In contrast if the same problem occurs while running pg_receivexlog it
waits for 5 seconds then reopens the connection. I think that pg_basebackup
should show the same resilience.
-Jürgen
From | Date | Subject | |
---|---|---|---|
Next Message | Michael Paquier | 2016-09-10 05:30:44 | Re: BUG #14321: pg_basebackup --xlog-method=stream fails |
Previous Message | Keith | 2016-09-10 00:02:56 | Re: BUG #14322: Possible inconsistent behavior with timestamp_to_str() |