Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers

From: Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com>
To: Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com>
Cc: Craig Ringer <craig(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers
Date: 2017-12-26 13:53:44
Message-ID: CAD21AoCfYEeO75NWxoRHi25+_j6LqgQMs2spfmmkacewijBo2w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
<petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> On 26/12/17 11:13, Masahiko Sawada wrote:
>> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
>> <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>>
>>>>
>>>> It's not a problem on crash restart because StartupReorderBuffer already
>>>> does the required delete.
>>>>
>>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
>>>> to have any guard to ensure that the segment files don't already exist
>>>> when it goes to serialize a snapshot. Adding one there would probably be
>>>> expensive; we don't know the last lsn of the txn yet, so to be really
>>>> safe we'd have to do a directory listing and scan for any txn-$OURXID-*
>>>> entries.
>>>>
>>>> So to fix, I suggest that we should do a
>>>> slot-specific StartupReorderBuffer-style deletion when we start a new
>>>> decoding session on a slot, per attached patch.
>>>>
>>>> It might be nice to also add a hook on proc exit, so we don't have stale
>>>> buffers lying around until next decoding session, but I didn't want to
>>>> add new global state to reorderbuffer.c so I've left that for now.
>>>
>>>
>>> Hmm, can't we simply call the new cleanup function in
>>> ReplicationSlotRelease()? That's called during process exit and we know
>>> there about slot so no extra global variables are needed.
>>>
>>
>> I guess that ReplicationSlotRelease() currently might not get called
>> if walsender exits by proc_exit(). ReplicationSlotRelease() can is
>> called by some functions such as WalSndErrorCleanup(), but at least in
>> the case where wal sender exits due to failed to write data to socket,
>> ReplicationSlotRelease() didn't get called as far as I tested.
>>
>
> Are you sure about that testing? Did you test it with replication slot
> active? ReplicationSlotRelease() is called from ProcKill() if the
> process is using a slot and should be called for any kind of exit except
> for outright crash (the kind that makes postgres kill all backends). If
> it wasn't we'd never unlock the replication slot used by the exiting
> walsender.
>

Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
ReplicationSlotRelease() got called. I agree that cleanup function
gets called in ReplicationSlotRelease().

Regards,

--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Pavel Stehule 2017-12-26 13:56:47 Re: Ethiopian calendar year(DATE TYPE) are different from the Gregorian calendar year
Previous Message Lelisa Diriba 2017-12-26 13:46:33 Ethiopian calendar year(DATE TYPE) are different from the Gregorian calendar year