| From: | Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com> | 
|---|---|
| To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> | 
| Subject: | PANIC during crash recovery of a recently promoted standby | 
| Date: | 2018-05-10 05:22:12 | 
| Message-ID: | CABOikdPOewjNL=05K5CbNMxnNtXnQjhTx2F--4p4ruorCjukbA@mail.gmail.com | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hello,
I recently investigated a problem where a standby is promoted to be the new
master. The promoted standby crashes shortly thereafter for whatever
reason. Upon running the crash recovery, the promoted standby (now master)
PANICs with message such as:
PANIC,XX000,"WAL contains references to invalid
pages",,,,,,,,"XLogCheckInvalidPages,
xlogutils.c:242",""
After investigation, I could recreate a reproduction scenario for this
problem. The attached TAP test (thanks Alvaro from converting my bash
script to a TAP test) demonstrates the problem. The test is probably
sensitive to timing, but it reproduces the problem consistently at least at
my end. While the original report was for 9.6, I can reproduce it on the
master and thus it probably affects all supported releases.
Investigations point to a possible bug where we fail to update the
minRecoveryPoint after completing the ongoing restart point upon promotion.
IMV after promotion the new master must always recover to the end of the
WAL to ensure that all changes are applied correctly. But what we've
instead is that minRecoveryPoint remains set to a prior location because of
this:
   /*
     * Update pg_control, using current time.  Check that it still shows
     * IN_ARCHIVE_RECOVERY state and an older checkpoint, else do nothing;
     * this is a quick hack to make sure nothing really bad happens if
somehow
     * we get here after the end-of-recovery checkpoint.
     */
   LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
    if (ControlFile->state == DB_IN_ARCHIVE_RECOVERY &&
        ControlFile->checkPointCopy.redo < lastCheckPoint.redo)
    {
        ControlFile->checkPoint = lastCheckPointRecPtr;
        ControlFile->checkPointCopy = lastCheckPoint;
        ControlFile->time = (pg_time_t) time(NULL);
        /*
         * Ensure minRecoveryPoint is past the checkpoint record.  Normally,
         * this will have happened already while writing out dirty buffers,
         * but not necessarily - e.g. because no buffers were dirtied.  We
do
         * this because a non-exclusive base backup uses minRecoveryPoint to
         * determine which WAL files must be included in the backup, and the
         * file (or files) containing the checkpoint record must be
included,
         * at a minimum. Note that for an ordinary restart of recovery
there's
         * no value in having the minimum recovery point any earlier than
this
         * anyway, because redo will begin just after the checkpoint record.
         */
        if (ControlFile->minRecoveryPoint < lastCheckPointEndPtr)
        {
            ControlFile->minRecoveryPoint = lastCheckPointEndPtr;
            ControlFile->minRecoveryPointTLI =
lastCheckPoint.ThisTimeLineID;
            /* update local copy */
            minRecoveryPoint = ControlFile->minRecoveryPoint;
            minRecoveryPointTLI = ControlFile->minRecoveryPointTLI;
        }
        if (flags & CHECKPOINT_IS_SHUTDOWN)
            ControlFile->state = DB_SHUTDOWNED_IN_RECOVERY;
        UpdateControlFile();
    }
    LWLockRelease(ControlFileLock);
After promotion, the minRecoveryPoint is only updated (cleared) when the
first regular checkpoint completes. If a crash happens before that, we will
run the crash recovery with a stale minRecoveryPoint, which results into
the PANIC that we diagnosed. The test case was written to reproduce the
issue as reported to us. Thus the test case TRUNCATEs and extends the table
at hand after promotion. The crash shortly thereafter leaves the pages in
uninitialised state because the shared buffers are not yet flushed to the
disk.
During crash recovery, we see uninitialised pages for the WAL records
written before the promotion. These pages are remembered and we expect to
either see a DROP TABLE or TRUNCATE WAL record before the minRecoveryPoint
is reached. But since the minRecoveryPoint is still pointing to a WAL
location prior to the TRUNCATE operation, crash recovery hits the
minRecoveryPoint before seeing the TRUNCATE WAL record. That results in a
PANIC situation.
I propose that we should always clear the minRecoveryPoint after promotion
to ensure that crash recovery always run to the end if a just-promoted
standby crashes before completing its first regular checkpoint. A WIP patch
is attached.
Thanks,
Pavan
-- 
 Pavan Deolasee                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services
| Attachment | Content-Type | Size | 
|---|---|---|
| 0002-Ensure-recovery-is-run-to-the-end-upon-promotion-of-.patch | application/octet-stream | 1.8 KB | 
| 0001-A-new-TAP-test-to-test-a-recovery-bug.patch | application/octet-stream | 3.2 KB | 
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Simon Riggs | 2018-05-10 05:42:03 | Re: Needless additional partition check in INSERT? | 
| Previous Message | Amit Langote | 2018-05-10 04:36:57 | Re: Needless additional partition check in INSERT? |