From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Alvaro Herrera <alvherre(at)alvh(dot)no-ip(dot)org> |
Cc: | Pavan Deolasee <pavan(dot)deolasee(at)gmail(dot)com>, "Wood, Dan" <hexpert(at)amazon(dot)com>, "pgsql-hackers(at)postgresql(dot)org" <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: VM map freeze corruption |
Date: | 2018-04-19 02:35:02 |
Message-ID: | CAD21AoCqXqQdPTTRt_bUikpZFS89+6KO8AyuvhRdaq55x7LaKQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Wed, Apr 18, 2018 at 10:36 PM, Alvaro Herrera
<alvherre(at)alvh(dot)no-ip(dot)org> wrote:
> Pavan Deolasee wrote:
>> On Wed, Apr 18, 2018 at 7:37 AM, Wood, Dan <hexpert(at)amazon(dot)com> wrote:
>
>> > My analysis is that heap_prepare_freeze_tuple->FreezeMultiXactId()
>> > returns FRM_NOOP if the MultiXACT locked rows haven't committed. This
>> > results in changed=false and totally_frozen=true(as initialized). When
>> > this returns to lazy_scan_heap(), no rows are added to the frozen[] array.
>> > Yet, tuple_totally_frozen is true. This means the page is marked frozen in
>> > the VM, even though the MultiXACT row wasn't left untouch.
>> >
>> > A fix to heap_prepare_freeze_tuple() that seems to do the trick is:
>> > else
>> > {
>> > Assert(flags & FRM_NOOP);
>> > + totally_frozen = false;
>> > }
>> >
>>
>> That's a great find!
>
> Indeed.
>
> This family of bugs (introduced by freeze map changes in 9.6) was
> initially fixed in 38e9f90a227d but this spot was missed in that fix.
>
> IMO the cause is the totally_frozen variable, which starts life in a
> bogus state (true) and the different code paths can set it to the right
> state, or by inaction end up deciding that the initial bogus state was
> correct in the first place. Errors of omission are far too easy in that
> kind of model, ISTM, so I propose this slightly different patch, which
> albeit yet untested seems easier to verify and easier to get right.
>
Thank you for the patch!
Agreed. The patch looks to fix this issue correctly.
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Geoghegan | 2018-04-19 02:58:15 | Re: Adding an LWLockHeldByMe()-like function that reports if any buffer content lock is held |
Previous Message | Michael Paquier | 2018-04-19 02:28:02 | Re: Speedup of relation deletes during recovery |