Re: BUG #8686: Standby could not restart.

From: Tomonari Katsumata <katsumata(dot)tomonari(at)po(dot)ntts(dot)co(dot)jp>
To: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8686: Standby could not restart.
Date: 2013-12-19 03:01:43
Message-ID: 52B26197.9080208@po.ntts.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi,

This is a patch against REL9_2_STABLE(*)
(*)0c07ef1ad2ba73a6a050f3e1a19267f961a43586

regards,
---------------
Tomonari Katsumata

(2013/12/19 11:57), katsumata(dot)tomonari(at)po(dot)ntts(dot)co(dot)jp wrote:
> The following bug has been logged on the website:
>
> Bug reference: 8686
> Logged by: Tomonari Katsumata
> Email address: katsumata(dot)tomonari(at)po(dot)ntts(dot)co(dot)jp
> PostgreSQL version: 9.2.6
> Operating system: Red Hat Enterprise Linux 6.2 x86_64
> Description:
>
> Hi,
>
>
> I'm testing whether a standby could restart with asynchronous
replication.
>
>
> The testcase is very simple like below.
> (A) start asyncronous replication
> (B) stop standby with "-m f"
> (C) start standby
>
>
> I tried (B) and (C) periodically.
>
>
> In almost cases, it worked fine.
> But standby could not start one time because of PANIC.
> ---- Log -----
> [Standby] 2013-12-09 10:42:30 JST LOG: 00000: database system was
shut down
> in recovery at 2013-12-09 10:42:29 JST
> [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6273
> cp: cannot stat `../arc/00000002.history': No such file or directory
> [Standby] 2013-12-09 10:42:30 JST LOG: 00000: entering standby mode
> [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6359
> cp: cannot stat `../arc/000000010000000100000008': No such file or
> directory
> [Standby] 2013-12-09 10:42:30 JST LOG: 00000: consistent recovery state
> reached at 1/8E7F110
> [Standby] 2013-12-09 10:42:30 JST LOCATION: CheckRecoveryConsistency,
> xlog.c:7366
> [Standby] 2013-12-09 10:42:30 JST LOG: 00000: restored log file
> "000000010000000100000007" from archive
> [Standby] 2013-12-09 10:42:30 JST LOCATION: RestoreArchivedFile,
> xlog.c:3273
> [Standby] 2013-12-09 10:42:30 JST LOG: 00000: redo starts at 1/783B230
> [Standby] 2013-12-09 10:42:30 JST LOCATION: StartupXLOG, xlog.c:6827
> [Standby] 2013-12-09 10:42:30 JST WARNING: 01000: page 1365 of relation
> base/16384/16388 does not exist
> [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel
> 1663/16384/16388; tid 1365/152; new 1365/153
> [Standby] 2013-12-09 10:42:30 JST LOCATION: report_invalid_page,
> xlogutils.c:66
> [Standby] 2013-12-09 10:42:30 JST PANIC: XX000: WAL contains
references to
> invalid pages
> [Standby] 2013-12-09 10:42:30 JST CONTEXT: xlog redo hot_update: rel
> 1663/16384/16388; tid 1365/152; new 1365/153
> [Standby] 2013-12-09 10:42:30 JST LOCATION: log_invalid_page,
> xlogutils.c:91
> [Standby] 2013-12-09 10:42:30 JST LOG: 00000: startup process (PID
12560)
> was terminated by signal 6: Aborted
> [Standby] 2013-12-09 10:42:30 JST LOCATION: LogChildExit,
> postmaster.c:2895
> [Standby] 2013-12-09 10:42:30 JST LOG: 00000: terminating any other
active
> server processes
> [Standby] 2013-12-09 10:42:30 JST LOCATION: HandleChildCrash,
> postmaster.c:2682
> [Standby] 2013-12-09 11:10:12 JST LOG: 00000: database system was
> interrupted while in recovery at log time 2013-12-09 10:32:14 JST
> [Standby] 2013-12-09 11:10:12 JST HINT: If this has occurred more
than once
> some data might be corrupted and you might need to choose an earlier
> recovery target.
> [Standby] 2013-12-09 11:10:12 JST LOCATION: StartupXLOG, xlog.c:6289
> ---------
>
>
> I tried to fix this problem.
> At first, I doubted the recovery state reached "consistent" before redo
> starts.
> And then I checked pg_control and related WAL.
> The WAL sequence is like below.
>
>
> WAL--(a)--(b)--(c)--(d)--(e)-->
> ================================================
> (a) Latest checkpoint's REDO location
> 1/783B230
>
>
> (b) hot_update
> 1/7842010
>
>
> (c) truncate
> 1/8E7E5C8
>
>
> (d) Latest checkpoint location
> 1/8E7F0B0
>
>
> (e) Minimum recovery ending location
> 1/8E7F110
> ================================================
>
>
> >From these things, I found it has happened with this scenario.
> ----------
> (1) standby starting
> (2) seeking checkpoint location 1/8E7F0B0 because backup_label is not
> absecnt
> (3) reachedConsistency is set to true at 1/8E7F110 in
> CheckRecoveryConsistent
> (4) redo start from 1/783B230
> (5) PANIC at 1/7842010 because reachedConsistency has set already and
> operating against a block which will be truncated at (c).
> ----------
>
>
> At step(2), EndRecPtr is set to 1/8E7F110(next to 1/8E7F0B0),
> so reachedConsistency is set to true at step(3).
>
>
> I think it's not need to increase EndRecPtr while seeking checkpoint
> location.
> I tried to revise it and this worked fine.
>
>
> I think this is very very narrow gate, but it could happen.
>
>
> I'm not sure this problem happen with synchronous replication or
> another version of PostgreSQL(ie. 9.3/9.1/9.0),
> but at least we need to fix it.
>
>
> regards,
> ----------------
> Tomonari Katsumata
>
>
>
>

Attachment Content-Type Size
do_not_increase_EndRecPtr_while_seek_ckptloc.patch text/x-diff 1.6 KB

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2013-12-19 15:58:07 Re: Update with subselect sometimes returns wrong result
Previous Message katsumata.tomonari 2013-12-19 02:57:26 BUG #8686: Standby could not restart.