standby fails with out-of-order XID insertion upon start

From: Radoslav Nedyalkov <rnedyalkov(at)gmail(dot)com>
To: pgsql-bugs(at)lists(dot)postgresql(dot)org
Subject: standby fails with out-of-order XID insertion upon start
Date: 2021-04-02 13:51:05
Message-ID: CANhtRibYPVfH65phBtVFQGL1FvYHEk30G_0rNLbpLy_Qa4K8Zg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Team,

We were building a standby (pg11.11) as log-shipping and then switched to
stream.
However after changing the recovery.conf and restarting the server it
failed with

2021-03-31 14:23:04 UTC [84999] 0: [27-1] user=,db= host=,app=[] FATAL:
out-of-order XID insertion in KnownAssignedXids
2021-03-31 14:23:04 UTC [84999] 0: [28-1] user=,db= host=,app=[] CONTEXT:
WAL redo at A8EA2/3A900150 for Standby/RUNNING_XACTS: nextXid 55152875
latestCompletedXid 55152874 oldestRunningXid 4260572256; 13 xacts: 53056024
53575669 41821378 53575647 41706387 53575629 4260572256 51340860 54917157
44843675 44370470 27311547 55152872

More details about the setup:
Basebackup was build from another standby (pg11.11) Amazon
Linux 5.4.95-42.163.amzn2.x86_64
Wal logs and streaming were from master (pg11.8)
Centos7 3.10.0-1127.10.1.el7.x86_64
Standby was hot_standby=on. It was restarted when it was already
operational and up-to-date
with the master.
The basebackup took 10 hours to build and then another 7 to catch up with
the wal apply.
Master was running some long running transactions. One of these -
4260572256 was lasting before and after and I guess a wraparound happened
during the build.(excuse my hint here).

After the failure we started the standby with hot_standby = off for a
minute and it applied the "offending" wal. We put it 'on' and it failed
again with the same error. We left it running 'off' for a couple of hours
in a hope out-of-order txs will be gone. And it worked. We switched it 'on'
and it became a fully functional standby which we promoted as master and
made a couple of tests.

Attaching a more detailed log.
Here is the relevant section from the wal file.

rmgr: Btree len (rec/tot): 53/ 5525, tx: 55152872, lsn:
A8EA2/3A8FEA00, prev A8EA2/3A8FE118, desc: INSERT_LEAF off 171, blkref #0:
rel 1663/16401
/30619182 blk 16810 FPW
rmgr: Btree len (rec/tot): 64/ 64, tx: 55152872, lsn:
A8EA2/3A8FFF98, prev A8EA2/3A8FEA00, desc: INSERT_LEAF off 198, blkref #0:
rel 1663/16401
/30619190 blk 1627541
rmgr: Standby len (rec/tot): 345/ 345, tx: 0, lsn:
A8EA2/3A8FFFD8, prev A8EA2/3A8FFF98, desc: LOCK xid 53575669 db 16401 rel
46272403 xid 41
821378 db 16401 rel 145209998 xid 53575669 db 16401 rel 46272400 xid
4260572256 db 16401 rel 144945162 xid 4260572256 db 16401 rel 144943160 xid
41821378 db 1
6401 rel 145212760 xid 27311547 db 16401 rel 145174122 xid 53575669 db
16401 rel 85876526 xid 53575647 db 16401 rel 145251397 xid 53575629 db
16401 rel 335265
72 xid 53575669 db 16401 rel 46272405 xid 53575629 db 16401 rel 145251380
xid 4260572256 db 16401 rel 144943131 xid 51340860 db 16401 rel 145230700
xid 426057
2256 db 16401 rel 144943165 xid 41821378 db 16401 rel 145212765 xid
41821378 db 16401 rel 145209993 xid 54917157 db 16401 rel 145276215 xid
41821378 db 16401
rel 145209944 xid 53575629 db 16401 rel 145251385 xid 54917157 db 16401 rel
145276220 xid 4260572256 db 16401 rel 144945157 xid 41706387 db 16401 rel
14520963
9 xid 53575647 db 16401 rel 89115947 xid 53575647 db 16401 rel 145251392
xid 53056024 db 16401 rel 145235217
rmgr: Standby len (rec/tot): 102/ 102, tx: 0, lsn:
A8EA2/3A900150, prev A8EA2/3A8FFFD8, desc: RUNNING_XACTS nextXid 55152875
latestCompletedXid 55152874 oldestRunningXid 4260572256; 13 xacts: 53056024
53575669 41821378 53575647 41706387 53575629 4260572256 51340860 54917157
44843675 44370470 27311547 55152872
rmgr: XLOG len (rec/tot): 106/ 106, tx: 0, lsn:
A8EA2/3A9001B8, prev A8EA2/3A900150, desc: CHECKPOINT_ONLINE redo
A8EA2/30DFAEA8; tli 7; prev tli 7; fpw true; xid 10:55138257; oid
145279495; multi 259; offset 660; oldest xid 3783095690 in DB 16401; oldest
multi 1 in DB 16401; oldest/newest commit timestamp xid: 0/0; oldest
running xid 4260572256; online
rmgr: Heap len (rec/tot): 130/ 130, tx: 55152872, lsn:
A8EA2/3A900228, prev A8EA2/3A9001B8, desc: INSERT off 47, blkref #0: rel
1663/16401/88935494 blk 15803338

Thanks and regards,
Radoslav

Attachment Content-Type Size
debug.log text/x-log 35.9 KB

Browse pgsql-bugs by date

  From Date Subject
Next Message Boris P. Korzun 2021-04-02 14:09:14 Re: Inconsistent behavior of pg_dump/pg_restore on DEFAULT PRIVILEGES
Previous Message egashira.yusuke@fujitsu.com 2021-04-02 11:09:21 RE: BUG #16922: In cascading replication, a standby server aborted when an upstream standby server promoted