Re: Basebackup fails without useful error message

From: Koen De Groote <kdg(dot)dev(at)gmail(dot)com>
To: Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
Cc: PostgreSQL General <pgsql-general(at)lists(dot)postgresql(dot)org>
Subject: Re: Basebackup fails without useful error message
Date: 2024-10-20 21:03:51
Message-ID: CAGbX52ENsSHKoTyu5+XfN1o1bZ2w2CJaE1oQnxcm=fj2SyoZXg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Hello Adrian, and everyone else.

It has finally happened, the backup ran into an error again, and the
verbose output set me on the right path.

I'm getting this error message:

> pg_basebackup: could not receive data from WAL stream: server closed the
connection unexpectedly
> This probably means the server terminated abnormally
> before or while processing the request.

Combined with the main server logging:

> terminating walsender process due to replication timeout

Now, the server is set up with an archive_command which gzips the WAL files
and writes them to a network filesystem.

From looking at machine metrics at the time, my conclusion is the following:

At the time of the error, the remote filesystem experienced a very high
queue size for new writes.

So I'm assuming the process of writing WAL files, if there is an
archive_command set, is only considered to be finished after the archive is
written, not just when the WAL file is written in pg_wal.

I'm also seeing in the documentation that the default WAL method for
pg_basebackup is "stream", which waits for these WAL files as they are
produced.

I suspect that I have 2 possible paths at this point:

1: increase wal_sender_timeout
2: run the basebackup with --wal-method=none since my restore_command is
set up to explicitly go to the very same network storage to get the
archived WAL files.

I'm going to be testing this. If someone could confirm that this is how
writing WAL files works, that being: that it is only considered "done" when
the archive_command is done, that would be great.

Regards,
Koen De Groote

On Sun, Sep 29, 2024 at 6:08 PM Adrian Klaver <adrian(dot)klaver(at)aklaver(dot)com>
wrote:

> On 9/29/24 08:57, Koen De Groote wrote:
> > > What is the complete command you are using?
> >
> > The full command is:
> >
> > pg_basebackup -h localhost -p 5432 -U basebackup_user -D
> > /mnt/base_backup/dir -Ft -z -P
> >
> > So output Format as tar, gzipped, and with progress being printed.
> >
> > > Have you looked at the Postgres log?
> >
> > > Is --verbose being used?
> >
> > This is straight from the logs, it's the only output besides the %
> > progress counter.
> >
> > Will have a look at --verbose.
>
> When you report on that and if it does not report the error then what is?:
>
> Postgres version.
>
> OS and version.
>
> Anything special about the cluster like tablespaces, extensions,
> replication, etc.
>
>
> >
> > Regards,
> > Koen De Groote
> >
>
> --
> Adrian Klaver
> adrian(dot)klaver(at)aklaver(dot)com
>
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Barry Walker 2024-10-20 21:08:53 Re: Help Resolving Compiler Errors With enable-dtrace Flag
Previous Message Adrian Klaver 2024-10-20 19:47:38 Re: Help Resolving Compiler Errors With enable-dtrace Flag