Re: 12.3 replicas falling over during WAL redo

From: Ben Chobot <bench(at)silentmedia(dot)com>
To: Alvaro Herrera <alvherre(at)2ndquadrant(dot)com>
Cc: pgsql-general <pgsql-general(at)postgresql(dot)org>
Subject: Re: 12.3 replicas falling over during WAL redo
Date: 2020-08-03 19:11:05
Message-ID: b64cb215-e03a-9e07-e916-5b216b0c5b1b@silentmedia.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Ben Chobot wrote on 8/1/20 9:58 AM:
> Alvaro Herrera wrote on 8/1/20 9:35 AM:
>> On 2020-Aug-01, Ben Chobot wrote:
>>
>> Can you find out what the index is being modified by those LSNs -- is it
>> always the same index?  Can you have a look at nearby WAL records that
>> touch the same page of the same index in each case?
>>
>> One possibility is that the storage forgot a previous write.

So here is me fumbling how to do this. Hopefully this is useful?

The error for this host was:

2020-07-30T14:59:36.839243+00:00 host postgres[24338]: [45253-1]
db=,user= WARNING:  specified item offset is too large
2020-07-30T14:59:36.839307+00:00 host postgres[24338]: [45253-2]
db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:36.839337+00:00 host postgres[24338]: [45254-1]
db=,user= PANIC:  btree_xlog_insert: failed to add item
2020-07-30T14:59:36.839366+00:00 host postgres[24338]: [45254-2]
db=,user= CONTEXT:  WAL redo at A0A/AC4204A0 for Btree/INSERT_LEAF: off 48
2020-07-30T14:59:37.587173+00:00 host postgres[24337]: [11-1] db=,user=
LOG:  startup process (PID 24338) was terminated by signal 6: Aborted
2020-07-30T14:59:37.587266+00:00 host postgres[24337]: [12-1] db=,user=
LOG:  terminating any other active server processes

So, starting from a filesystem snapshot where the db had crashed and not
yet been started, I did:

/usr/lib/postgresql/12/bin/pg_waldump -p
/var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB
0000000100000A0A000000AD | grep A0A/AC4204A0

This shows me:

rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0:
rel 16605/16613/60529051 blk 6501

So then I did:

/usr/lib/postgresql/12/bin/pg_waldump -p
/var/lib/postgresql/12/main/pg_wal/ 0000000100000A0A000000AB
0000000100000A0A000000AD | grep 16605/16613/60529051
rmgr: Btree       len (rec/tot):     53/   813, tx:   76393363, lsn:
A0A/AB289778, prev A0A/AB287F30, desc: INSERT_LEAF off 26, blkref #0:
rel 16605/16613/60529051 blk 6089 FPW
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393390, lsn:
A0A/AB2BE870, prev A0A/AB2BD6E0, desc: INSERT_LEAF off 128, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393394, lsn:
A0A/AB2C43D0, prev A0A/AB2C4378, desc: INSERT_LEAF off 41, blkref #0:
rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76393992, lsn:
A0A/AB5FA848, prev A0A/AB5FA7F0, desc: INSERT_LEAF off 133, blkref #0:
rel 16605/16613/60529051 blk 5999
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394938, lsn:
A0A/ABE49738, prev A0A/ABE47AF8, desc: INSERT_LEAF off 189, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76394977, lsn:
A0A/ABEB0330, prev A0A/ABEB02D8, desc: INSERT_LEAF off 204, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396065, lsn:
A0A/AC4204A0, prev A0A/AC420450, desc: INSERT_LEAF off 48, blkref #0:
rel 16605/16613/60529051 blk 6501
rmgr: Btree       len (rec/tot):     72/    72, tx:   76396075, lsn:
A0A/AC427518, prev A0A/AC4274C0, desc: INSERT_LEAF off 211, blkref #0:
rel 16605/16613/60529051 blk 6519
rmgr: Btree       len (rec/tot):     53/  2529, tx:   76396250, lsn:
A0A/AC5352F8, prev A0A/AC533A00, desc: INSERT_LEAF off 87, blkref #0:
rel 16605/16613/60529051 blk 5639 FPW
pg_waldump: fatal: error in WAL record at A0A/AC5411B0: invalid resource
manager ID 110 at A0A/AC5411E0

...and I have no idea what I'm looking at. I assume/hope the error at
the end is due to the db shutting down, and nothing to be particularly
worried about?

I have the specific index that postgres tripped on here and I'll spend
some time today trying to see if the other crashes were the same index.

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Alvaro Herrera 2020-08-03 19:34:04 Re: 12.3 replicas falling over during WAL redo
Previous Message Peter Geoghegan 2020-08-03 18:25:08 Re: 12.3 replicas falling over during WAL redo