| From: | PG Bug reporting form <noreply(at)postgresql(dot)org> | 
|---|---|
| To: | pgsql-bugs(at)lists(dot)postgresql(dot)org | 
| Cc: | mb(at)dataegret(dot)com | 
| Subject: | BUG #15151: Error with wal replay after planned manual switchover. | 
| Date: | 2018-04-12 12:20:09 | 
| Message-ID: | 152353560977.31233.11406243320876395727@wrigleys.postgresql.org | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-bugs | 
The following bug has been logged on the website:
Bug reference:      15151
Logged by:          Maxim Boguk
Email address:      mb(at)dataegret(dot)com
PostgreSQL version: 9.6.8
Operating system:   Linux
Description:        
Hi,
Today after planned manual switch-over (via remove all load from master,
checkpoint master, wait replica in sync, stop master, stop replica, switch
master and replica roles without promote), I got very strange error on new
replica (log_min_messages=debug5):
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  database
system was shut down in recovery at 2018-04-12 04:23:26 MSK
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  entering
standby mode
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:28 MSK 32060 rohh(dot)app(at)hh from 192.168.1.2 [vxid: txid:0] []
FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32061 rohh(dot)app(at)hh from [local] [vxid: txid:0] []
FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32063 [unknown](at)[unknown] from [local] [vxid:
txid:0] [] LOG:  incomplete startup packet
2018-04-12 04:23:28 MSK 32064 postgres(at)postgres from [local] [vxid: txid:0]
[] FATAL:  the database system is starting up
2018-04-12 04:23:28 MSK 32062 @ from  [vxid: txid:0] [] LOG:  started
streaming WAL from primary at C7BB/ED000000 on timeline 1
2018-04-12 04:23:28 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:30 MSK 32062 @ from  [vxid: txid:0] [] FATAL:  terminating
walreceiver process due to administrator command
2018-04-12 04:23:30 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:30 MSK 32059 @ from  [vxid: txid:0] [] LOG:  invalid
resource manager ID 211 at C7BB/EDC42C68
and so on.
Start with log_min_messages=debug5 - doesn't add any useful information.
Restart of new slave doesn't help.
Stop new slave, manual clean xlog directory on slave and start slave -
doesn't help (e.g. it doesn't look like the single wal file was damaged
during network transfer).
Server offline at this moment and open for experiments for few days.
pg_xlogdump show no errors in wal files:
postgres(at)db-hh3:~$ /usr/lib/postgresql/9.6/bin/pg_xlogdump
--start=C7BB/EDC42C68
/var/lib/postgresql/9.6/main/pg_xlog/000000010000C7BB000000ED | less
first record is after C7BB/EDC42C68, at C7BB/EDC433B8, skipping over 1872
bytes
rmgr: XLOG        len (rec/tot):     51/  2556, tx:          0, lsn:
C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326257 FPW
rmgr: XLOG        len (rec/tot):     51/  3140, tx:          0, lsn:
C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 32094 FPW
rmgr: XLOG        len (rec/tot):     51/  2595, tx:          0, lsn:
C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326250 FPW
rmgr: XLOG        len (rec/tot):     51/  3130, tx:          0, lsn:
C7BB/EDC45440, prev C7BB/EDC44A18, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 15989 FPW
rmgr: XLOG        len (rec/tot):     51/  2590, tx:          0, lsn:
C7BB/EDC46098, prev C7BB/EDC45440, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326298 FPW
Few records before C7BB/EDC42C68:
rmgr: XLOG        len (rec/tot):     51/  2520, tx:          0, lsn:
C7BB/EDC41F38, prev C7BB/EDC41348, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326155 FPW
rmgr: XLOG        len (rec/tot):     51/  2703, tx:          0, lsn:
C7BB/EDC42928, prev C7BB/EDC41F38, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326262 FPW
rmgr: XLOG        len (rec/tot):     51/  2556, tx:          0, lsn:
C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326257 FPW
rmgr: XLOG        len (rec/tot):     51/  3140, tx:          0, lsn:
C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 32094 FPW
rmgr: XLOG        len (rec/tot):     51/  2595, tx:          0, lsn:
C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326250 FPW
pg_xlogdump show no visible to me anomalies in 000000010000C7BB000000ED or
000000010000C7BB000000EC files.
Any additional information I'll try provide, but my experience with internal
wal structure outside of using pg_xlogdump output very limited.
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Tom Lane | 2018-04-12 22:14:03 | Re: BUG #15150: Reading uninitialised value in NISortAffixes (tsearch/spell.c) | 
| Previous Message | Arthur Zakirov | 2018-04-12 11:56:39 | Re: BUG #15150: Reading uninitialised value in NISortAffixes (tsearch/spell.c) |