Re: BUG #8686: Standby could not restart.

From: Tomonari Katsumata <katsumata(dot)tomonari(at)po(dot)ntts(dot)co(dot)jp>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>, Tomonari Katsumata <t(dot)katsumata1122(at)gmail(dot)com>
Cc: pgsql-bugs(at)postgresql(dot)org
Subject: Re: BUG #8686: Standby could not restart.
Date: 2014-01-08 08:41:29
Message-ID: 52CD0F39.6000403@po.ntts.co.jp
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hi Heikki,

Sorry for slow response and
thank you for new patch!

I've tried your patch, but the patch was not for 92_STABLE.
So I created a patch for 92_STABLE(*) from your patch.
(*)against 8aa6912b8fec3400441c365bde6a1030295de749

It worked fine!
If there are no problem, please commit this patch also to 92_STABLE.

NOTE:
I think your original patch has a miss typing.
We should use 'checkPoint' not 'checkpoint'.
-----------------------------------------------------------
<original>
24:+ xlogctl->replayEndRecPtr = checkpoint.redo;
27:+ xlogctl->lastReplayedEndRecPtr = checkpoint.redo;

<fix>
24:+ xlogctl->replayEndRecPtr = checkPoint.redo;
27:+ xlogctl->lastReplayedEndRecPtr = checkPoint.redo;
-----------------------------------------------------------

regards,
-------------
Tomonari Katsumata

(2014/01/06 23:38), Heikki Linnakangas wrote:
> On 12/23/2013 08:15 AM, Tomonari Katsumata wrote:
>>> /*
>>>> * Initialize shared replayEndRecPtr,
>>>> lastReplayedEndRecPtr, and
>>>> * recoveryLastXTime.
>>>> *
>>>> * This is slightly confusing if we're starting from an
>>>> online
>>>> * checkpoint; we've just read and replayed the
>>>> checkpoint record, but
>>>> * we're going to start replay from its redo pointer,
>>>> which precedes
>>>> * the location of the checkpoint record itself. So even
>>>> though the
>>>> * last record we've replayed is indeed ReadRecPtr, we
>>>> haven't
>>>> * replayed all the preceding records yet. That's OK for
>>>> the current
>>>> * use of these variables.
>>>> */
>>>> SpinLockAcquire(&xlogctl->info_lck);
>>>> xlogctl->replayEndRecPtr = ReadRecPtr;
>>>> xlogctl->lastReplayedEndRecPtr = EndRecPtr;
>>>> xlogctl->recoveryLastXTime = 0;
>>>> xlogctl->currentChunkStartTime = 0;
>>>> xlogctl->recoveryPause = false;
>>>> SpinLockRelease(&xlogctl->info_lck);
>>>
>>> I think we need to fix that confusion. Your patch will do it by not
>>> setting EndRecPtr yet; that fixes the bug, but leaves those
variables in a
>>> slightly strange state; I'm not sure what EndRecPtr points to in
that case
>>> (0 ?), but ReadRecPtr would be set I guess.
>> Yes, the values were set like below.
>> ReadRecPtr:1/8E7F0B0
>> EndRecPtr:0/0
>>
>>
>>>
>>> Perhaps we should reset replayEndRecPtr and lastReplayedEndRecPtr
to the
>>> REDO point here, instead of ReadRecPtr/EndRecPtr.
>>
>> I made another patch.
>> I added a ReadRecord to make sure the REDO location is present or not.
>> The similar process are done when we use backup_label.
>>
>> Because the ReadRecord returns a record already read,
>> I set ReadRecPtr of the record to EndRecPtr.
>> And also I set record->xl_prev to ReadRecPtr.
>> As you said, it also worked fine.
>>
>> I'm not sure we should do same thing when crash recovery occurs, but
now I
>> added the process when archive recovery is needed.
>>
>> Please see attached patch.
>
> Hmm. That would still initialize lastReplayedEndRecPtr to the
checkpoint record, when you do crash recovery (or begin archive recovery
from a filesystem snapshot, where you perform crash recovery before
starting to read the archive). I'm not very comfortable with that, even
though I don't see an immediate problem with it.
>
> I also noticed that CheckRecoveryConsistency() compares
backupEndPoint with EndRecPtr, but minRecoveryPoint with
lastReplayedEndRecPtr. That's correct as the code stands, but it's an
accident waiting to happen: if you called CheckRecoveryConsistency()
after reading a record with ReadRecord(), but before fully replaying it,
it might conclude that it's reached the backup end location one record
too early. And it's inconsistent, anyway.
>
> I propose the attached patch. I haven't tested it, but I think it's a
slightly more robust fix.
>
> - Heikki
>
>

Attachment Content-Type Size
fix-lastReplayedEndRecPtr-1_for92stable.patch text/x-patch 3.2 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Alvaro Herrera 2014-01-08 14:00:58 Re: Backends stuck in LISTEN
Previous Message Michael Paquier 2014-01-08 01:43:08 Re: BUG #8746: While Calling Trigger on same table