Re: pg_basebackup connection closed unexpectedly...

From: Mladen Marinović <mladen(dot)marinovic(at)kset(dot)org>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: pg_basebackup connection closed unexpectedly...
Date: 2020-02-13 09:41:32
Message-ID: CAHjkqPRwHXw56GYAPO65ZTQsdFXG7YQ2nJX2LFOhyq-A3eF9wg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Wed, Feb 12, 2020 at 4:09 PM Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> wrote:

> =?UTF-8?Q?Mladen_Marinovi=C4=87?= <mladen(dot)marinovic(at)kset(dot)org> writes:
> > Recently I am having some strange problems with pg_basebackup. About
> once a
> > week the backup process ends with an error message like this:
> > 2020-02-11 23:25:40 UTC [25790]: [1-1] user=replicator,db=[unknown] LOG:
> > could not send data to client: Connection reset by peer
>
> Hmmm ....
>
> > The problem started occurring after a hardware (RAM + SSD) upgrade and an
> > OS Upgrade to Ubuntu 18.04. Both the server and backup process run in
> > separate docker containers on the same machine. This happens randomly on
> > multiple servers with the same configuration and it is probably not
> > hardware related. Also, this happens evenly on 9.4 and 9.6, and using the
> > same docker images that worked flawlessly on the previous installation.
> > I have been investigating the issue for at least a month and found no
> > problems in any log or metric before or after the event. I suspect that
> > this is related to some OS/docker parameter that is not well configured.
>
> How long does the backup run before failing? If the connection were going
> between different machines my suspicions would lean toward a network
> timeout. That seems somewhat unlikely in this configuration, but you
> never know.
>

The backup started at 23:00, and it copied 363GB by the time the connection
was closed. It usually takes about 2 hours for the entire database (cca.
1.1TB). I was also thinking that the problem could be network related, but
the network is a virtual docker bridge network on a single machine, and the
backup is usually ok. If it failed during other operations (as this is a
production database) or during every backup it would be easier to see what
the problem could be, but this is really annoyingly random.

>
> > Would increasing the database log level give me any more info about what
> > caused the connection to close?
>
> Nope, not directly. It might be useful to figure out whether data
> transfer continues full throttle right up until the connection drop,
> or whether it stops sooner (and then there's some sort of timeout
> before the error occurs).
>

I can see that pg_basebackup has a verbose switch, but I am not sure it
will report the stuff you mention. On the database, the log levels
currently are:
client_min_messages = notice
log_min_messages = warning
log_min_error_statement = error

I assume that I should change the first two to at least debug1 to see
something.

> regards, tom lane
>

Regards,
Mladen Marinović

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Jason Ralph 2020-02-13 12:31:42 pg_upgrade —link does it remove table bloat
Previous Message Michael Paquier 2020-02-13 03:49:46 Re: JIT on Windows with Postgres 12.1