Correct setup for 3+ node cluster with streaming replication and WAL archiving

From: Andre Piwoni <apiwoni(at)webmd(dot)net>
To: pgsql-general(at)lists(dot)postgresql(dot)org
Subject: Correct setup for 3+ node cluster with streaming replication and WAL archiving
Date: 2019-02-14 23:04:45
Message-ID: CAEC-+VE1wD2t9x3Lz3hjvh9shv+Osveo4M-P0L_92OGyi96eOg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

I'm having issues with replication on slave node after shutting down
primary node and somehow successful promotion of another slave to master
node.
I have setup streaming replication with WAL archiving that works until
failover is triggered when slave, after shutdown and repointing to new
primary, is unable to restore from archive.
This results in growing replication lag.

Notable thing about this setup is that archive location remains the same.
That is, it contains WAL files from old primary and after failover from new
primary. I'm not sure if this is correct setup.
Old primary written 000000010000000000000006 WAL file during shutdown but
new primary wants to write 000000010000000000000005 that already exists and
fails because of this. Furthermore, slave repointed to new master tries to
restore from 000000010000000000000007 which does not exist in WAL archive
directory.

It seems a lot of scripts for PgPool to repoint other slaves to new master
after failover simply take pg_basebackup, which I think is ridiculous waste
of time and resources for larger databases, but I'm wondering if this is
the only way to recover slaves. Anyone can tell me what am I doing wrong in
this particular setup?

Each node is configured as follows:
archive_mode = on
archive_command = 'test ! -f /mnt/pgsql/archive/%f && cp %p
/mnt/pgsql/archive/%f'

Each slave node is configured as follows:
standby_mode = 'on'
restore_command = 'cp /mnt/pgsql/archive/%f %p'

I'm running Postgresql 10.5 with PgPool which triggers failover and repoint
slave(s) to new master.

Sequence of events is as follows:
1) Shutdown primary
2) Promote eligible slave to new master with WAL archiving on
3) Shutdown other slave(s)
4) Repoint other slave(s) to new master
5) Start other slave(s)

Some relevant log lines:

OLD PRIMARY:
2019-02-14 11:13:04.072 PST [4720] FATAL: the database system is shutting
down
2019-02-14 11:13:04.105 PST [4209] DEBUG: write 0/7000098 flush 0/7000098
apply 0/7000098
2019-02-14 11:13:04.116 PST [4721] FATAL: the database system is shutting
down
2019-02-14 11:13:04.119 PST [32702] DEBUG: server process (PID 4721)
exited with exit code 1
2019-02-14 11:13:04.172 PST [32711] DEBUG: archived write-ahead log file
"000000010000000000000006"

PROMOTED SLAVE:

2019-02-14 11:13:04.933 PST [1578] CONTEXT: WAL redo at 0/7000028 for
XLOG/CHECKPOINT_SHUTDOWN: redo 0/7000028; tli 1; prev tli 1; fpw true; xid
0:300730; oid 17950; multi 1; offset 0; oldest xid 548 in DB 1; oldest
multi 1 in DB 1; oldest/newest commit timestamp xid: 0/0; oldest running
xid 0; shutdown
2019-02-14 11:13:04.933 PST [1582] DEBUG: sending write 0/7000098 flush
0/7000098 apply 0/7000098
2019-02-14 11:13:04.937 PST [1582] LOG: replication terminated by primary
server
2019-02-14 11:13:04.937 PST [1582] DETAIL: End of WAL reached on timeline
1 at 0/7000098.
2019-02-14 11:13:04.937 PST [1582] FATAL: could not send end-of-streaming
message to primary: no COPY in progress
2019-02-14 11:13:04.939 PST [1578] DEBUG: switched WAL source from stream
to archive after failure
2019-02-14 11:13:04.939 PST [1578] LOG: invalid record length at
0/7000098: wanted 24, got 0
2019-02-14 11:13:09.943 PST [2335] FATAL: could not connect to the primary
server: could not connect to server: Connection refused
Is the server running on host "pg-hdp-node1.kitchen.local" (172.28.128.54)
and accepting
TCP/IP connections on port 5432?
2019-02-14 11:13:11.988 PST [1579] DEBUG: performing replication slot
checkpoint
2019-02-14 11:13:11.993 PST [1579] DEBUG: attempting to remove WAL
segments older than log file 000000000000000000000005

2019-02-14 11:13:19.119 PST [1578] LOG: received promote request
2019-02-14 11:13:19.119 PST [1578] LOG: redo done at 0/7000028
2019-02-14 11:13:19.119 PST [1578] DEBUG: resetting unlogged relations:
cleanup 0 init 1
2019-02-14 11:13:19.121 PST [1578] LOG: selected new timeline ID: 2
2019-02-14 11:13:19.173 PST [1578] DEBUG: could not remove file
"pg_wal/000000020000000000000007": No such file or directory
2019-02-14 11:13:19.175 PST [1578] LOG: archive recovery complete
2019-02-14 11:13:19.178 PST [1578] DEBUG: attempting to remove WAL
segments newer than log file 000000020000000000000007
2019-02-14 11:13:19.180 PST [1578] DEBUG: MultiXactId wrap limit is
2147483648, limited by database with OID 1
2019-02-14 11:13:19.180 PST [1578] DEBUG: MultiXact member stop limit is
now 4294914944 based on MultiXact 1
2019-02-14 11:13:19.180 PST [1578] DEBUG: release all standby locks
2019-02-14 11:13:19.182 PST [1574] DEBUG: starting background worker
process "logical replication launcher"
2019-02-14 11:13:19.183 PST [1574] LOG: database system is ready to accept
connections
2019-02-14 11:13:19.198 PST [2346] DETAIL: The failed archive command was:
test ! -f /mnt/pgsql/archive/000000010000000000000005 && cp
pg_wal/000000010000000000000005 /mnt/pgsql/archive/000000010000000000000005
2019-02-14 11:13:19.199 PST [1579] DEBUG: attempting to remove WAL
segments older than log file 000000000000000000000006
2019-02-14 11:13:19.200 PST [1579] DEBUG: recycled write-ahead log file
"000000010000000000000006"
2019-02-14 11:13:20.201 PST [2346] LOG: archive command failed with exit
code 1

OTHER SLAVE
2019-02-14 11:13:14.723 PST [5807] FATAL: could not connect to the primary
server: could not connect to server: Connection refused
Is the server running on host "pg-hdp-node1.kitchen.local"
(172.28.128.54) and accepting
TCP/IP connections on port 5432?
2019-02-14 11:13:16.072 PST [1839] DEBUG: performing replication slot
checkpoint
2019-02-14 11:13:16.077 PST [1839] DEBUG: attempting to remove WAL
segments older than log file 000000000000000000000005
2019-02-14 11:13:16.078 PST [1839] DEBUG: recycled write-ahead log file
"000000010000000000000005"
2019-02-14 11:13:19.412 PST [1838] DEBUG: switched WAL source from stream
to archive after failure
2019-02-14 11:13:19.412 PST [1838] DEBUG: invalid resource manager ID 64
at 0/7000098
2019-02-14 11:13:19.412 PST [1838] DEBUG: switched WAL source from archive
to stream after failure
2019-02-14 11:13:24.418 PST [1839] LOG: shutting down
2019-02-14 11:13:24.427 PST [1834] LOG: database system is shut down
2019-02-14 11:13:25.096 PST [5839] LOG: database system was shut down in
recovery at 2019-02-14 11:13:24 PST
2019-02-14 11:13:25.096 PST [5839] DEBUG: standby_mode = 'on'
2019-02-14 11:13:25.096 PST [5839] DEBUG: primary_conninfo =
'user=replication password=password host=pg-hdp-node3.kitchen.local
port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres
target_session_attrs=any'
2019-02-14 11:13:25.096 PST [5839] DEBUG: restore_command = 'cp
/mnt/pgsql/archive/%f %p'
2019-02-14 11:13:25.096 PST [5839] LOG: entering standby mode
cp: cannot stat ‘/mnt/pgsql/archive/000000010000000000000007’: No such file
or directory
2019-02-14 11:13:25.102 PST [5839] DEBUG: could not restore file
"000000010000000000000007" from archive: child process exited with exit
code 1
2019-02-14 11:13:25.149 PST [5839] DEBUG: switched WAL source from stream
to archive after failure
cp: cannot stat ‘/mnt/pgsql/archive/000000010000000000000007’: No such file
or directory
2019-02-14 11:13:25.154 PST [5839] DEBUG: could not restore file
"000000010000000000000007" from archive: child process exited with exit
code 1
2019-02-14 11:13:25.154 PST [5839] DEBUG: invalid resource manager ID 64
at 0/7000098

*Andre Piwoni*

Browse pgsql-general by date

  From Date Subject
Next Message Alan Nilsson 2019-02-15 00:17:58 loading plpython error
Previous Message Michael Lewis 2019-02-14 22:58:25 Re: Postgresql RDS DB Latency Chossing Hash join Plan