Streaming replication

From: Mike Howland <jkashoto(at)gmail(dot)com>
To: pgsql-novice(at)lists(dot)postgresql(dot)org
Subject: Streaming replication
Date: 2017-12-03 02:29:50
Message-ID: CAGAqBGL2CysUbrKhoPWBMppmRPvVV+NUQNjnikr2o5qbGP8Etg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-novice

Hello all,

I have a question that perhaps somebody can help me with. I'm using
streaming replication with 9.2 (I know it's quite old now, we plan to move
to 9.6 shortly but I'd like to get these older machines working if I can).

I have two machines where the secondary is configured as a backup of the
primary machine. I used the instructions here:
https://wiki.postgresql.org/wiki/Streaming_Replication to establish
streaming replication... BTW, I'm using the 7.1 path (using
backup/rsync/!backup) to prime the files on the secondary machine.

Most of the time it comes up, connects and is quite happy. For my own
sanity I have a bit of monitoring code that occasionally injects a change
on the primary and watches for it to arrive on the secondary to ensure data
is flowing. I also have some code that checks to see how close things are
using pg_current_xlog_location() and pg_last_xlog_receive_location(). When
things are working, these checks are always fine.

Once in a while, this nonce value does not seem to be updated on the
secondary server (although I can see it change on the primary). During this
period of time, my watcher code also detects
that pg_last_xlog_receive_location() is ahead of
pg_current_xlog_location(). From what I can tell, the secondary has
connected to the primary, it's just data does not seem to be flowing. I'll
attach a bit of detal below.

I'm sure I must be doing something wrong when I prime the secondary
database. It works most of the time but that's simply not good enough.
I'd buy a virtual coffee for anybody that can show me the error of my ways
here...
Any ideas?

A bit of output/background that may be useful:

Output from watch dog code output (running on the secondary). It queries
the local (secondary) cluster and also connects to the primary cluster to
insert a nonce and gather some information.
The nonce is just a counting int. When the secondary is started, it starts
counting at 0 and increases every few minutes. As you can see, the data in
the secondary is still at 18 (from a previous switch).
This is very odd because the primary cluster was rsync'ed shortly before
this.

last xlog receive 0/5000000
current xlog location: 0/41A3008
WARNING: Secondary appears ahead of primary:
WARNING!! nonce match comparison failed: 2 != 18
dog: nonce did not compare, but it's not been very long yet.

The sequence I'm using to prime the secondary:

0) Request that the secondary be shutdown and internally backed up.

1) Without backup being set (just to move the bulk of the files)
rsync -a -v --rsh ssh -o StrictHostKeyChecking=no -o BatchMode=yes --delete
--exclude pg_log --exclude recovery.conf --exclude backup_label --exclude
pg_hba.conf --exclude postmaster.pid --exclude postmaster.opts --exclude
postgresql.conf /db/pg/general 77.66.27.224:/db/pg

2) Enter backup mode
SELECT pg_start_backup('sync-backup', true);

3) Rsync again, hopefully quicker than #1. I wanted to minimize the time
it spent in backup.
rsync -a -v --rsh ssh -o StrictHostKeyChecking=no -o BatchMode=yes --delete
--exclude pg_log --exclude recovery.conf --exclude backup_label --exclude
pg_hba.conf --exclude postmaster.pid --exclude postmaster.opts --exclude
postgresql.conf /db/pg/general 77.66.27.224:/db/pg

4) Exit backup mode
SELECT pg_stop_backup();

5) Rsync pg_xlog files
rsync -a -v --rsh ssh -o StrictHostKeyChecking=no -o BatchMode=yes --delete
/db/pg/general/pg_xlog 77.66.27.224:/db/pg/general

6) Rsync archive files to secondary
rsync -a -v --rsh ssh -o StrictHostKeyChecking=no -o BatchMode=yes --delete
--exclude syncInProgress --exclude archiveScript /db/pg/general.archive
77.66.27.224:/db/pg

7) Tell the secondary server to start up

Output from postgres on the secondary server when it starts. I've tried
looking into the cannot stat issues without much success. Note that in
situations where it does work, it looks just like this as well (complete
with the zero length warning, etc).

LOG: database system was shut down in recovery at 2017-11-29 06:43:13 GMT
cp: cannot stat '{$0}{$8}/db/pg/general.archive/00000002.history'{$0}{$9}:
No such file or directory
LOG: entering standby mode
cp: cannot stat
'{$0}{$8}/pg/general.archive/000000010000000000000005'{$0}{$9}: No such
file or directory
LOG: consistent recovery state reached at 0/5000080
LOG: record with zero length at 0/5000080
LOG: database system is ready to accept read only connections
cp: cannot stat
'{$0}{$8}/db/pg/general.archive/000000010000000000000005'{$0}{$9}: No such
file or directory
cp: cannot stat '{$0}{$8}/db/pg/general.archive/00000002.history'{$0}{$9}:
No such file or directory
LOG: streaming replication successfully connected to primary

Any thoughts at all are appreciated!
Thanks,

-mike

Responses

Browse pgsql-novice by date

  From Date Subject
Next Message Laurenz Albe 2017-12-04 07:16:02 Re: Streaming replication
Previous Message Julia B Pinchak 2017-12-01 01:25:05 Connecting to Postgresql 10 port 5433 server on Stack Builder 4.1.0