Quick Links

BUG #15151: Error with wal replay after planned manual switchover.

From:	PG Bug reporting form <noreply(at)postgresql(dot)org>
To:	pgsql-bugs(at)lists(dot)postgresql(dot)org
Cc:	mb(at)dataegret(dot)com
Subject:	BUG #15151: Error with wal replay after planned manual switchover.
Date:	2018-04-12 12:20:09
Message-ID:	152353560977.31233.11406243320876395727@wrigleys.postgresql.org
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-bugs

The following bug has been logged on the website:

Bug reference: 15151
Logged by: Maxim Boguk
Email address: mb(at)dataegret(dot)com
PostgreSQL version: 9.6.8
Operating system: Linux
Description:

Hi,

Today after planned manual switch-over (via remove all load from master,
checkpoint master, wait replica in sync, stop master, stop replica, switch
master and replica roles without promote), I got very strange error on new
replica (log_min_messages=debug5):

2018-04-12 04:23:28 MSK 32059 @ from [vxid: txid:0] [] LOG: database
system was shut down in recovery at 2018-04-12 04:23:26 MSK
2018-04-12 04:23:28 MSK 32059 @ from [vxid: txid:0] [] LOG: entering
standby mode
2018-04-12 04:23:28 MSK 32059 @ from [vxid: txid:0] [] LOG: invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:28 MSK 32060 rohh(dot)app(at)hh from 192.168.1.2 [vxid: txid:0] []
FATAL: the database system is starting up
2018-04-12 04:23:28 MSK 32061 rohh(dot)app(at)hh from [local] [vxid: txid:0] []
FATAL: the database system is starting up
2018-04-12 04:23:28 MSK 32063 [unknown](at)[unknown] from [local] [vxid:
txid:0] [] LOG: incomplete startup packet
2018-04-12 04:23:28 MSK 32064 postgres(at)postgres from [local] [vxid: txid:0]
[] FATAL: the database system is starting up
2018-04-12 04:23:28 MSK 32062 @ from [vxid: txid:0] [] LOG: started
streaming WAL from primary at C7BB/ED000000 on timeline 1
2018-04-12 04:23:28 MSK 32059 @ from [vxid: txid:0] [] LOG: invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:30 MSK 32062 @ from [vxid: txid:0] [] FATAL: terminating
walreceiver process due to administrator command
2018-04-12 04:23:30 MSK 32059 @ from [vxid: txid:0] [] LOG: invalid
resource manager ID 211 at C7BB/EDC42C68
2018-04-12 04:23:30 MSK 32059 @ from [vxid: txid:0] [] LOG: invalid
resource manager ID 211 at C7BB/EDC42C68

and so on.
Start with log_min_messages=debug5 - doesn't add any useful information.

Restart of new slave doesn't help.
Stop new slave, manual clean xlog directory on slave and start slave -
doesn't help (e.g. it doesn't look like the single wal file was damaged
during network transfer).

Server offline at this moment and open for experiments for few days.

pg_xlogdump show no errors in wal files:
postgres(at)db-hh3:~$ /usr/lib/postgresql/9.6/bin/pg_xlogdump
--start=C7BB/EDC42C68
/var/lib/postgresql/9.6/main/pg_xlog/000000010000C7BB000000ED | less
first record is after C7BB/EDC42C68, at C7BB/EDC433B8, skipping over 1872
bytes
rmgr: XLOG len (rec/tot): 51/ 2556, tx: 0, lsn:
C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326257 FPW
rmgr: XLOG len (rec/tot): 51/ 3140, tx: 0, lsn:
C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 32094 FPW
rmgr: XLOG len (rec/tot): 51/ 2595, tx: 0, lsn:
C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326250 FPW
rmgr: XLOG len (rec/tot): 51/ 3130, tx: 0, lsn:
C7BB/EDC45440, prev C7BB/EDC44A18, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 15989 FPW
rmgr: XLOG len (rec/tot): 51/ 2590, tx: 0, lsn:
C7BB/EDC46098, prev C7BB/EDC45440, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326298 FPW

Few records before C7BB/EDC42C68:

rmgr: XLOG len (rec/tot): 51/ 2520, tx: 0, lsn:
C7BB/EDC41F38, prev C7BB/EDC41348, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326155 FPW
rmgr: XLOG len (rec/tot): 51/ 2703, tx: 0, lsn:
C7BB/EDC42928, prev C7BB/EDC41F38, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326262 FPW

rmgr: XLOG len (rec/tot): 51/ 2556, tx: 0, lsn:
C7BB/EDC433B8, prev C7BB/EDC42928, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326257 FPW
rmgr: XLOG len (rec/tot): 51/ 3140, tx: 0, lsn:
C7BB/EDC43DB8, prev C7BB/EDC433B8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/150046445 blk 32094 FPW
rmgr: XLOG len (rec/tot): 51/ 2595, tx: 0, lsn:
C7BB/EDC44A18, prev C7BB/EDC43DB8, desc: FPI_FOR_HINT , blkref #0: rel
1663/16400/18105 blk 326250 FPW

pg_xlogdump show no visible to me anomalies in 000000010000C7BB000000ED or
000000010000C7BB000000EC files.

Any additional information I'll try provide, but my experience with internal
wal structure outside of using pg_xlogdump output very limited.

Responses

Re: BUG #15151: Error with wal replay after planned manual switchover. at 2018-04-13 01:10:46 from Michael Paquier

Browse pgsql-bugs by date

	From	Date	Subject
Next Message	Tom Lane	2018-04-12 22:14:03	Re: BUG #15150: Reading uninitialised value in NISortAffixes (tsearch/spell.c)
Previous Message	Arthur Zakirov	2018-04-12 11:56:39	Re: BUG #15150: Reading uninitialised value in NISortAffixes (tsearch/spell.c)