Re: Reliable WAL file shipping over unreliable network

From: Laurenz Albe <laurenz(dot)albe(at)cybertec(dot)at>
To: Nagy László Zsolt <gandalf(at)shopzeus(dot)com>, pgsql-admin(at)lists(dot)postgresql(dot)org
Subject: Re: Reliable WAL file shipping over unreliable network
Date: 2018-02-28 21:16:56
Message-ID: 1519852616.13006.7.camel@cybertec.at
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Nagy László Zsolt wrote:
> > > Do I have to copy
> > > segments to temp files, and rename them when they are fully flushed to
> > > disk? Or is it okay to have half complete files in the archive dir for a
> > > while?
> >
> > I suppose you are talking about "archive_command" here.
> >
> > If the file restored with "restore_command" is too small,
> > the operation fails, and you get a DEBUG1 message:
> >
> > archive file "..." has wrong size: ... instead of ...
> >
> > So nothing can go wrong there.
>
> Nothing can go wrong? Does it mean that PostgreSQL will re-execute the
> restore_command if the file was too small? If it won't retry the
> restore_command, then everything goes wrong. It might be documented
> somewhere, but apparently I'm out of luck with documentations.

To verify that restore_command won't accept a file that's too short,
read the code in in backend/access/transam/xlogarchive.c

The behavior for streaming replication with a WAL archive is documented here:
https://www.postgresql.org/docs/current/static/warm-standby.html#STANDBY-SERVER-OPERATION

"At startup, the standby begins by restoring all WAL available in the archive
location, calling restore_command. Once it reaches the end of WAL available
there and restore_command fails, it tries to restore any WAL available in the
pg_wal directory. If that fails, and streaming replication has been configured,
the standby tries to connect to the primary server and start streaming WAL
from the last valid record found in archive or pg_wal. If that fails or
streaming replication is not configured, or if the connection is later
disconnected, the standby goes back to step 1 and tries to restore the file
from the archive again. This loop of retries from the archive, pg_wal, and
via streaming replication goes on until the server is stopped or failover
is triggered by a trigger file."

> > > And finally: if I also enable streaming replication, then it seems that
> > > log file shipping is not needed at all. If I omit archive_command and
> > > restore_command from the configs, and setup the replication slots and
> > > primary_conninfo only, then it seems to be working just fine. But when
> > > the network goes down for a while, then the slave goes out of sync and
> > > it cannot recover. It was not clear for me from the documentation, but
> > > am I right in that I can combine log file shipping with streaming
> > > replication, and achieve small replication delays plus the ability to
> > > recover after a longer period if network outage?
> >
> > If you use a replication slot, the standby will never get out of sync
> > because the primary will retain all WAL that the standby has not
> > received yet.
> >
> > Streaming replication together with archive recovery is only useful
> > if you are *not* using replication slots.
>
> So you are saying that if I use replication slots, then I can completely
> forget about manual WAL file shipping.

Precisely.

> There is one thing in the docs
> that contradicts the above statement.
>
> This is from
> https://www.postgresql.org/docs/10/static/warm-standby.html#STREAMING-REPLICATION
>
> > If you use streaming replication without file-based continuous
> > archiving, the server might recycle old WAL segments before the
> > standby has received them. If this occurs, the standby will need to be
> > reinitialized from a new base backup.

This only talks about the case where you do not use replication slots.

Read https://www.postgresql.org/docs/current/static/warm-standby.html#STREAMING-REPLICATION-SLOTS:

"Replication slots provide an automated way to ensure that the master does not
remove WAL segments until they have been received by all standbys, and that
the master does not remove rows which could cause a recovery conflict even
when the standby is disconnected."

The key word is "ensure".

Yours,
Laurenz Albe

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Alexandre Garcia 2018-02-28 21:17:08 Re: postgresql 9.6 - cannot freeze committed xmax
Previous Message Andres Freund 2018-02-28 21:16:22 Re: postgresql 9.6 - cannot freeze committed xmax