Re: Standby corruption after master is restarted

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc: emre(at)hasegeli(dot)com, tomas(dot)vondra(at)2ndquadrant(dot)com, pgsql-bugs(at)postgresql(dot)org, gurkan(dot)gur(at)innogames(dot)com, david(dot)pusch(at)innogames(dot)com, patrick(dot)schmidt(at)innogames(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Re: Standby corruption after master is restarted
Date: 2018-04-27 01:04:11
Message-ID: 20180427010411.GF3419@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs pgsql-hackers

On Fri, Apr 27, 2018 at 09:49:08AM +0900, Kyotaro HORIGUCHI wrote:
> Thank you for noticing me of that. Is there any way to know how a
> bug report has been concluded? Or should I search -hackers for
> a corresponding thread?

Keeping a look at the list of patches for bugs in the CF app, and
looking at the list of open items is what I use now. Now for this
particular issue my memory has just served me well as it is hard to know
that both are the same issue by looking at the title. Good thing I
looked at your patch as well.

> At Thu, 26 Apr 2018 21:13:48 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in <20180426121348(dot)GA2365(at)paquier(dot)xyz>
>> On Thu, Apr 26, 2018 at 07:53:04PM +0900, Kyotaro HORIGUCHI wrote:
>>> I think this behavior is a bug. XLogReadRecord is considering the
>>> case but palloc_extended() breaks it. So in the attached, add a
>>> new flag MCXT_ALLOC_NO_PARAMERR to palloc_extended() and
>>> allocate_recordbuf calls it with the new flag. That alone fixes
>>> the problem. However, the patch frees read state buffer facing
>>> errorneous records since such records can leave a too-large
>>> buffer allocated.
>>
>> This problem is already discussed here:
>> https://commitfest.postgresql.org/18/1516/
>>
>> And here is the thread:
>> https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F8B57AD(at)G01JPEXMBYT05
>>
>> Tsunakawa-san and I discussed a couple of approaches. Extending
>> palloc_extended so as an incorrect length does not result in an error is
>> also something that crossed by mind, but the length handling is
>> different on the backend and the frontend, so I discarded the idea you
>> are proposing here and instead relied on a check with AllocSizeIsValid,
>> which gives a more simple patch:
>> https://www.postgresql.org/message-id/20180314052753.GA16179%40paquier.xyz
>
> Yeah, perhaps all approaches in the thread came to my mind but
> choosed different one. I'm fine with the approach in the thread.

Okay, cool.

>> This got no interest from committers yet unfortunately, but I think that
>> this is a real problem which should be back-patched :(
>
> Several other WAL-related fixes are also waiting to be picked up..

Yeah, simply ignoring corrupted 2PC files at redo is no fun, as well as
is breaking the promise of replication slots. Let's just make sure that
everything is properly tracked and listed, that's the least we can do.
--
Michael

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Michael Paquier 2018-04-27 01:22:03 Re: BUG #15114: logical decoding Segmentation fault
Previous Message Kyotaro HORIGUCHI 2018-04-27 00:49:08 Re: Standby corruption after master is restarted

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Gierth 2018-04-27 01:15:23 Re: Toast issues with OldestXmin going backwards
Previous Message Andrew Gierth 2018-04-27 01:03:21 Re: Toast issues with OldestXmin going backwards