Re: Failed recovery with new faster 2PC code

From: Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
To: Stas Kelvich <s(dot)kelvich(at)postgrespro(dot)ru>
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>, Jeff Janes <jeff(dot)janes(at)gmail(dot)com>, Michael Paquier <michael(dot)paquier(at)gmail(dot)com>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Jesper Pedersen <jesper(dot)pedersen(at)redhat(dot)com>
Subject: Re: Failed recovery with new faster 2PC code
Date: 2017-04-18 08:57:10
Message-ID: CAMGcDxeykkrKCk0FY9Pzt5JusLWw4woKXs8NoqjbOZfQQZ-i2Q@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Please find attached a second version of my bug fix which is stylistically
better and clearer than the first one.

Regards,
Nikhils

On 18 April 2017 at 13:47, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com> wrote:

> Hi,
>
> There was a bug in the redo 2PC remove code path. Because of which,
> autovac would think that the 2PC is gone and cause removal of the
> corresponding clog entry earlier than needed.
>
> Please find attached, the bug fix: 2pc_redo_remove_bug.patch.
>
> I have been testing this on top of Michael's 2pc-restore-fix.patch and
> things seem to be ok for the past one+ hour. Will keep it running for long.
>
> Jeff, thanks for these very useful scripts. I am going to make a habit to
> run these scripts on my side from now on. Do you have any other script that
> I could try against these patches? Please let me know.
>
> Regards,
> Nikhils
>
> On 18 April 2017 at 12:09, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
> wrote:
>
>>
>>
>> On 17 April 2017 at 15:02, Nikhil Sontakke <nikhils(at)2ndquadrant(dot)com>
>> wrote:
>>
>>>
>>>
>>>> >> commit 728bd991c3c4389fb39c45dcb0fe57e4a1dccd71
>>>> >> Author: Simon Riggs <simon(at)2ndQuadrant(dot)com>
>>>> >> Date: Tue Apr 4 15:56:56 2017 -0400
>>>> >>
>>>> >> Speedup 2PC recovery by skipping two phase state files in normal
>>>> path
>>>> >
>>>> > Thanks Jeff for your tests.
>>>> >
>>>> > So that's now two crash bugs in as many days and lack of clarity about
>>>> > how to fix it.
>>>> >
>>>>
>>>
>>> The issue seems to be that a prepared transaction is yet to be
>> committed. But autovacuum comes in and causes the clog to be truncated
>> beyond this prepared transaction ID in one of the runs.
>>
>> We only add the corresponding pgproc entry for a surviving 2PC
>> transaction on completion of recovery. So could be a race condition here.
>> Digging in further.
>>
>> Regards,
>> Nikhils
>> --
>> Nikhil Sontakke http://www.2ndQuadrant.com/
>> PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>>
>
>
>
> --
> Nikhil Sontakke http://www.2ndQuadrant.com/
> PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services
>

--
Nikhil Sontakke http://www.2ndQuadrant.com/
PostgreSQL/Postgres-XL Development, 24x7 Support, Training & Services

Attachment Content-Type Size
2pc_redo_remove_bug_v2.0.patch application/octet-stream 786 bytes

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Kyotaro HORIGUCHI 2017-04-18 09:12:38 Re: Passing values to a dynamic background worker
Previous Message Heikki Linnakangas 2017-04-18 08:55:45 Re: CREATE TRIGGER document typo