From: | Jürgen Strobel <juergen+postgresql(at)strobel(dot)info> |
---|---|
To: | Michael Paquier <michael(dot)paquier(at)gmail(dot)com> |
Cc: | PostgreSQL Bugs <pgsql-bugs(at)postgresql(dot)org> |
Subject: | Re: BUG #14321: pg_basebackup --xlog-method=stream fails |
Date: | 2016-09-10 19:28:08 |
Message-ID: | CALWJi_e3F+KxJAaWe7AK9V12EWeLTcDEMkc-AK8=2Rd1n8-8fQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On 10 September 2016 at 07:30, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>
wrote:
> On Sat, Sep 10, 2016 at 9:10 AM, Jürgen Strobel
> <juergen+postgresql(at)strobel(dot)info> wrote:
> > First, I do have another WAL archive (usually).
> > But no I only see the first WAL segments up to the point when the problem
> > occurs, then nothing more.
> >
> > The timeline as far as I can tell is:
> >
> > 1. pg_basebackup --xlog-method=stream starts and creates 2 connections
> for
> > backup and WAL streaming.
> > 2. The VM's crappy IO system hickups and stalls the whole VM for a
> > surprisingly long time.
>
> I know that people can do fancy things here, believe me.
>
> > 3. The server runs into wal_sender_timeout and closes the WAL streaming
> > connection.
> > 4. pg_basebackup prints the warning, and continues the filesystem copy,
> *but
> > makes no effort to re-open the WAL streaming connection*. With ps I see
> > zombie child of the pg_basbackup process, I assume that's the one doing
> the
> > WAL streaming.
> > 5. pg_baseback finishes up with the second half of pg_xlog missing, and
> the
> > DB fails to start.
> >
> > In contrast if the same problem occurs while running pg_receivexlog it
> waits
> > for 5 seconds then reopens the connection. I think that pg_basebackup
> should
> > show the same resilience.
>
> You can blame your VM here to begin with :(
> Even with the default values of pg_basebackup
>
> --status-interval and
> wal_sender_timeout on the server there is enough margin to prevent
> things to get killed, but if things get heavily constrained on I/O...
> Well, there is not much than any software could do... Now I agree that
> there would be room for improvement to make pg_basebackup retry a
> stream instead of failing, and that may be something that people would
> be willing to have. But that's hard to think about improvements in
> this area as something else than a new feature, and not a bug.
>
> Anyway, replication slots would not help here if you just rely on
> pg_basebackup to finish the job.
> --
> Michael
>
I do agree the VM is bad, but I have to work with what I got now.
I do not agree it's a pure feature request though. When this problem
happens pg_baseback should either abort fully with a suitable error, or
retry streaming WAL until it got everything it needs for a functional
backup (or streaming fails due to WAL cleanup on the server). The current
behavior of finishing the filesystem backup with a mere warning is
inconsistent and not user friendly. If I use --xlog-method=stream I expect
to end up with all WAL in the end or to get a clear error. It took me quite
some time to figure out what's happening. And of course this never happened
in QA/staging systems, only in production.
I understand that this may not affect many people, and that it's not going
to get immediate attention, classify it as you wish.
The replication slot feature might make it easier for me to recover from
the problem using pg_receivexlog afterwards.
-Jürgen
From | Date | Subject | |
---|---|---|---|
Next Message | Haribabu Kommi | 2016-09-12 01:04:14 | Re: BUG #14314: Mismatch of comment of a function. |
Previous Message | Michael Paquier | 2016-09-10 05:30:44 | Re: BUG #14321: pg_basebackup --xlog-method=stream fails |