Re: BUG #13822: Slave terminated - WAL contains references to invalid page

From: <Marek(dot)Petr(at)tieto(dot)com>
To: <michael(dot)paquier(at)gmail(dot)com>
Cc: <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #13822: Slave terminated - WAL contains references to invalid page
Date: 2015-12-30 07:43:52
Message-ID: 80e2163be21c4f7da5651ff0106d1f81@C105S135VM024.eu.tieto.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Same issue simulated via PITR from base backup up to 5 minutes after the crash:

2015-12-30 08:31:19 CET @ LOG: database system was interrupted; last known up at 2015-12-21 20:15:02 CET
2015-12-30 08:31:19 CET @ LOG: starting point-in-time recovery to 2015-12-22 00:30:00+01
2015-12-30 08:31:19 CET @ LOG: database system was not properly shut down; automatic recovery in progress
2015-12-30 08:31:19 CET @ LOG: redo starts at 186/D7260540
2015-12-30 08:31:19 CET @ LOG: consistent recovery state reached at 186/D8000000
2015-12-30 08:31:19 CET @ LOG: database system is ready to accept read only connections
2015-12-30 08:31:19 CET @ LOG: restored log file "0000000100000186000000D8" from archive
2015-12-30 08:31:20 CET @ LOG: restored log file "0000000100000186000000D9" from archive
/.../
2015-12-30 08:32:48 CET @ LOG: restored log file "00000001000001870000009D" from archive
2015-12-30 08:32:50 CET @ LOG: restored log file "00000001000001870000009E" from archive
2015-12-30 08:32:51 CET @ WARNING: page 73172 of relation base/16422/23253 is uninitialized
2015-12-30 08:32:51 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 73172
2015-12-30 08:32:51 CET @ PANIC: WAL contains references to invalid pages
2015-12-30 08:32:51 CET @ CONTEXT: xlog redo visible: rel 1663/16422/23253; blk 73172
2015-12-30 08:32:51 CET @ LOG: startup process (PID 29324) was terminated by signal 6: Aborted
2015-12-30 08:32:51 CET @ LOG: terminating any other active server processes

Regards
Marek

-----Original Message-----
From: Petr Marek
Sent: Tuesday, December 29, 2015 4:52 PM
To: 'Michael Paquier' <michael(dot)paquier(at)gmail(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: RE: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

I used fresh base backup for slave after both crashes.
Also tried to scan archived wals several hours before last crash and found only the following for string 71566:

rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/987859D8, prev 187/987859A0, bkp: 0000, desc: visible: rel 1663/16422/17216; blk 71566
rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9CC59020, prev 187/9CC58FE8, bkp: 0000, desc: visible: rel 1663/16422/17220; blk 71566
rmgr: Heap2 len (rec/tot): 20/ 52, tx: 0, lsn: 187/9E356D98, prev 187/9E356D60, bkp: 0000, desc: visible: rel 1663/16422/23253; blk 71566

Regards
Marek

-----Original Message-----
From: Michael Paquier [mailto:michael(dot)paquier(at)gmail(dot)com]
Sent: Monday, December 28, 2015 4:29 PM
To: Petr Marek <Marek(dot)Petr(at)tieto(dot)com>
Cc: PostgreSQL mailing lists <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: [BUGS] BUG #13822: Slave terminated - WAL contains references to invalid page

On Mon, Dec 28, 2015 at 8:50 PM, <Marek(dot)Petr(at)tieto(dot)com> wrote:
> Tried to use pageinspect module for affected pages from last two occurences:
>
> 2015-12-15 13:05:39 CET @ WARNING: page 4333275 of relation
> base/16422/17230 is uninitialized
> 2015-12-22 00:25:11 CET @ WARNING: page 71566 of relation
> base/16422/23253 is uninitialized
>
> Following outputs are the same for master and slave:
>
> [results]

Hm, OK.

> Non-default pars:
>
> [params]

There is nothing fishy here.

> Slave rebuilded and it's running almost a week for now.

Hm. Has this slave replayed the same WAL records as the slave that has failed previously? Or did it use a fresher base backup? If that's the latter the problem would have been fixed by itself for those two relation pages as they would have been correctly created by the And and not the WAL replay. Perhaps that's too late, but could it be possible to scan the WAL segments you have and see if there is record referring to those pages being initialized or not? You would need to find a record like that:
[insert|update|multi-insert|hot_update](init) rel %u/%u/%u; tid %u/%u tid is t_ctid referred in those upper results you just sent. And this record should normally be present before the ones that caused the PANIC setting the visibility map bit. If that's not the case, it may be possible that there is actually a bug if the page is not found as being initialized properly first. At least we are sure that the corruption is not coming from the master.
--
Michael

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message alirazaidi 2015-12-30 11:48:04 BUG #13839: command line restart server fail
Previous Message Henrik Pauli 2015-12-29 23:59:17 Re: BUG #13829: Exponentiation operator is left-associative