Re: BUG #14321: pg_basebackup --xlog-method=stream fails

From: Jürgen Strobel <juergen+postgresql(at)strobel(dot)info>
To: Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #14321: pg_basebackup --xlog-method=stream fails
Date: 2016-09-10 19:28:08
Message-ID: CALWJi_e3F+KxJAaWe7AK9V12EWeLTcDEMkc-AK8=2Rd1n8-8fQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On 10 September 2016 at 07:30, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:

> On Sat, Sep 10, 2016 at 9:10 AM, Jürgen Strobel
> <juergen+postgresql(at)strobel(dot)info> wrote:
> > First, I do have another WAL archive (usually).
> > But no I only see the first WAL segments up to the point when the problem
> > occurs, then nothing more.
> >
> > The timeline as far as I can tell is:
> >
> > 1. pg_basebackup --xlog-method=stream starts and creates 2 connections
> for
> > backup and WAL streaming.
> > 2. The VM's crappy IO system hickups and stalls the whole VM for a
> > surprisingly long time.
>
> I know that people can do fancy things here, believe me.
>
> > 3. The server runs into wal_sender_timeout and closes the WAL streaming
> > connection.
> > 4. pg_basebackup prints the warning, and continues the filesystem copy,
> *but
> > makes no effort to re-open the WAL streaming connection*. With ps I see
> > zombie child of the pg_basbackup process, I assume that's the one doing
> the
> > WAL streaming.
> > 5. pg_baseback finishes up with the second half of pg_xlog missing, and
> the
> > DB fails to start.
> >
> > In contrast if the same problem occurs while running pg_receivexlog it
> waits
> > for 5 seconds then reopens the connection. I think that pg_basebackup
> should
> > show the same resilience.
>
> You can blame your VM here to begin with :(
> Even with the default values of pg_basebackup
> ​​
> --status-interval and
> wal_sender_timeout on the server there is enough margin to prevent
> things to get killed, but if things get heavily constrained on I/O...
> Well, there is not much than any software could do... Now I agree that
> there would be room for improvement to make pg_basebackup retry a
> stream instead of failing, and that may be something that people would
> be willing to have. But that's hard to think about improvements in
> this area as something else than a new feature, and not a bug.
>
> Anyway, replication slots would not help here if you just rely on
> pg_basebackup to finish the job.
> --
> Michael
>

​I do agree the VM is bad, but I have to work with what I got now.

I do not agree it's a pure feature request though. When this problem
happens pg_baseback should either abort fully with a suitable error, or
retry streaming WAL until it got everything it needs for a functional
backup (or streaming fails due to WAL cleanup on the server). The current
behavior of finishing the filesystem backup with a mere warning is
inconsistent and not user friendly. If I use --xlog-method=stream I expect
to end up with all WAL in the end or to get a clear error. It took me quite
some time to figure out what's happening. And of course this never happened
in QA/staging systems, only in production.

I understand that this may not affect many people, and that it's not going
to get immediate attention, classify it as you wish.

​The replication slot feature might make it easier for me to recover from
the problem using pg_receivexlog afterwards.​

-Jürgen

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Haribabu Kommi 2016-09-12 01:04:14 Re: BUG #14314: Mismatch of comment of a function.
Previous Message Michael Paquier 2016-09-10 05:30:44 Re: BUG #14321: pg_basebackup --xlog-method=stream fails