Quick Links

Re: 12.3 replicas falling over during WAL redo

From:	Ben Chobot <bench(at)silentmedia(dot)com>
To:	Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc:	pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject:	Re: 12.3 replicas falling over during WAL redo
Date:	2020-08-03 19:11:05
Message-ID:	b64cb215-e03a-9e07-e916-5b216b0c5b1b@silentmedia.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-general

Ben Chobot wrote on 8/1/20 9:58 AM:
> Alvaro Herrera wrote on 8/1/20 9:35 AM:
>> On 2020-Aug-01, Ben Chobot wrote:
>>
>> Can you find out what the index is being modified by those LSNs -- is it
>> always the same index? Can you have a look at nearby WAL records that
>> touch the same page of the same index in each case?
>>
>> One possibility is that the storage forgot a previous write.

So here is me fumbling how to do this. Hopefully this is useful?

The error for this host was:

2020-07-30T14:59:36.839243+00:00 host postgres[24338]: [45253-1]
db=,user= WARNING: specified item offset is too large
2020-07-30T14:59:36.839307+00:00 host postgres[24338]: [45253-2]
db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 host postgres[24338]: [45254-1]
db=,user= PANIC: btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 host postgres[24338]: [45254-2]
db=,user= CONTEXT: WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 host postgres[24337]: [11-1] db=,user=
LOG: startup process (PID 24338) was terminated by signal 6: Aborted
2020-07-30T14:59:37.587266+00:00 host postgres[24337]: [12-1] db=,user=
LOG: terminating any other active server processes

So, starting from a filesystem snapshot where the db had crashed and not
yet been started, I did:

/usr/lib/postgresql/12/bin/pg_waldump -p
/var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB
0000000100000A0A000000AD | grep A0A/AC4204A0

This shows me:

rmgr: Btree len (rec/tot): 72/ 72, tx: 76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0:
rel 16605/16613/60529051 blk 6501

So then I did:

/usr/lib/postgresql/12/bin/pg_waldump -p
/var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB
0000000100000A0A000000AD | grep 16605/16613/60529051
rmgr: Btree       len (rec/tot):     53/   813, tx:   76393363, lsn:
A0A/AB289778, prev A0A/AB287F30, desc: INSERT_LEAF off 26, blkref #0:
rel 16605/16613/60529051 blk 6089 FPW
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393390, lsn:
A0A/AB2BE870, prev A0A/AB2BD6E0, desc: INSERT_LEAF off 128, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0:
rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393992, lsn:
A0A/AB5FA848, prev A0A/AB5FA7F0, desc: INSERT_LEAF off 133, blkref #0:
rel 16605/16613/60529051 blk 5999
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394938, lsn:
A0A/ABE49738, prev A0A/ABE47AF8, desc: INSERT_LEAF off 189, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394977, lsn:
A0A/ABEB0330, prev A0A/ABEB02D8, desc: INSERT_LEAF off 204, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0:
rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396075, lsn:
A0A/AC427518, prev A0A/AC4274C0, desc: INSERT_LEAF off 211, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     53/ 2529, tx:   76396250, lsn:
A0A/AC5352F8, prev A0A/AC533A00, desc: INSERT_LEAF off 87, blkref #0:
rel 16605/16613/60529051 blk 5639 FPW
pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource
manager ID 110 at A0A/AC5411E0

...and I have no idea what I'm looking at. I assume/hope the error at
the end is due to the db shutting down, and nothing to be particularly
worried about?

I have the specific index that postgres tripped on here and I'll spend
some time today trying to see if the other crashes were the same index.

In response to

Re: 12.3 replicas falling over during WAL redo at 2020-08-01 16:58:05 from Ben Chobot

Responses

Re: 12.3 replicas falling over during WAL redo at 2020-08-03 19:34:04 from Alvaro Herrera
Re: 12.3 replicas falling over during WAL redo at 2020-08-03 20:49:55 from Ben Chobot

Browse pgsql-general by date

	From	Date	Subject
Next Message	Alvaro Herrera	2020-08-03 19:34:04	Re: 12.3 replicas falling over during WAL redo
Previous Message	Peter Geoghegan	2020-08-03 18:25:08	Re: 12.3 replicas falling over during WAL redo