Standby Mechanics: WAL vs Streaming

From: Don Seiler <don(at)seiler(dot)us>
To: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: Standby Mechanics: WAL vs Streaming
Date: 2017-08-26 22:58:02
Message-ID: CAHJZqBA35ZnPc0pnxC_XoZUxoXNeBjBWMW0cNtKv9zut+LoJ=A@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Primary is 9.2.18, standby is 9.2.10. I know, I know. We're upgrading both
to 9.2.22 in the next maintenance window. Standby is configured for
streaming replication with the primary_conninfo parameter set. The primary
has a bash script for archive_command that copies the WAL file to NFS and
then rsync's the file to the replicas.

After some index rebuilds and manual vaccum today, the standby started
lagging by a couple of hours. After some time I noticed that the WAL files
being rsync'ed were still way behind where the standby was recovered to.
For example, pg_last_xact_replay_timestamp would be at 2PM but the WAL
files being rsynced hadn't even made it past noon yet.

So that's when I did remember that this is streaming replication, so the
standby is getting the xlog records directly from the primary. The WAL
files themselves aren't used at all by the standby, but the standby was
cleaning them up, presumably because it saw that the data in them was
applied hours ago.

So I'm writing just to get a confirmation or correction of my understanding
of the process here, as well as some suggestions. Here's my first swing:

1. Transaction completes on primary, xlog record is created. When the
standby needs it, it pulls it from the primary (using the primary_conninfo
credentials).
2. When a WAL file is archived, our own script runs an rsync command to
push it to the standby.
3. The standby will recover from the streaming xlog data if local WAL
files are not available. (So the standby prefers local WAL files to
streaming?)
4. If WAL files are available in an archive dir, restore_command is used
to restore those files to the pg_xlog directory and the archive file
removed.
5. When a WAL file is no longer needed for recovery, the standby will
delete them from the pg_xlog directory.

Assuming I'm right so far, I have these questions:

1. I saw that the WAL archives were being removed from the archive
directory. If the standby was already ahead of the WAL files because of
streaming replication, would it still use the restore_command to restore
WAL files from archive to px_log directory? Or would it just delete the
archive files?
2. As far as the lag goes, what can we do to increase the throughput for
streaming replication? We tried to give it almost all of our 100Mb pipe
between data centers and there was still significant lag until it recovered
past the point of my manual vacuums.
3. The archive_command on the primary seemed to run serial and wait,
since it took 4-5 seconds to rsync each 16Mb WAL file from primary to
standby. Is there any way to parallelize this? I guess we could just have
archive_command do a local copy and then have a separate script in cron to
handle some multiple rsync calls (with a max concurrency, of course)?
4. I'm going to assume that having both streaming replication and WAL
rsyncing are going to compete for bandwidth. Would it be advisable to only
do streaming most of the day and then maybe rsync WAL files in a nightly
batch? Now that I think of it, we do hourly snapmirrors of that WAL storage
already (also using the same bandwidth), perhaps we just use that for the
WAL replication.

I'd love to know how others might be handling this. If there's any
important information that I've left out, please let me know.

Don.
--
Don Seiler
www.seiler.us

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message MichaelDBA 2017-08-27 02:42:55 Re: Standby Mechanics: WAL vs Streaming
Previous Message Mariel Cherkassky 2017-08-24 13:15:19 printing results of query to file in different times