From: | Masahiko Sawada <sawada(dot)mshk(at)gmail(dot)com> |
---|---|
To: | Petr Jelinek <petr(dot)jelinek(at)2ndquadrant(dot)com> |
Cc: | Craig Ringer <craig(at)2ndquadrant(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de> |
Subject: | Re: BUGFIX: standby disconnect can corrupt serialized reorder buffers |
Date: | 2017-12-26 13:53:44 |
Message-ID: | CAD21AoCfYEeO75NWxoRHi25+_j6LqgQMs2spfmmkacewijBo2w@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Dec 26, 2017 at 10:03 PM, Petr Jelinek
<petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
> On 26/12/17 11:13, Masahiko Sawada wrote:
>> On Tue, Dec 26, 2017 at 12:49 AM, Petr Jelinek
>> <petr(dot)jelinek(at)2ndquadrant(dot)com> wrote:
>>
>>>>
>>>> It's not a problem on crash restart because StartupReorderBuffer already
>>>> does the required delete.
>>>>
>>>> ReorderBufferSerializeTXN, which spills the txns to disk, doesn't appear
>>>> to have any guard to ensure that the segment files don't already exist
>>>> when it goes to serialize a snapshot. Adding one there would probably be
>>>> expensive; we don't know the last lsn of the txn yet, so to be really
>>>> safe we'd have to do a directory listing and scan for any txn-$OURXID-*
>>>> entries.
>>>>
>>>> So to fix, I suggest that we should do a
>>>> slot-specific StartupReorderBuffer-style deletion when we start a new
>>>> decoding session on a slot, per attached patch.
>>>>
>>>> It might be nice to also add a hook on proc exit, so we don't have stale
>>>> buffers lying around until next decoding session, but I didn't want to
>>>> add new global state to reorderbuffer.c so I've left that for now.
>>>
>>>
>>> Hmm, can't we simply call the new cleanup function in
>>> ReplicationSlotRelease()? That's called during process exit and we know
>>> there about slot so no extra global variables are needed.
>>>
>>
>> I guess that ReplicationSlotRelease() currently might not get called
>> if walsender exits by proc_exit(). ReplicationSlotRelease() can is
>> called by some functions such as WalSndErrorCleanup(), but at least in
>> the case where wal sender exits due to failed to write data to socket,
>> ReplicationSlotRelease() didn't get called as far as I tested.
>>
>
> Are you sure about that testing? Did you test it with replication slot
> active? ReplicationSlotRelease() is called from ProcKill() if the
> process is using a slot and should be called for any kind of exit except
> for outright crash (the kind that makes postgres kill all backends). If
> it wasn't we'd never unlock the replication slot used by the exiting
> walsender.
>
Ah, sorry I was wrong. WalSndErrorCleanup() didn't get called but
ReplicationSlotRelease() got called. I agree that cleanup function
gets called in ReplicationSlotRelease().
Regards,
--
Masahiko Sawada
NIPPON TELEGRAPH AND TELEPHONE CORPORATION
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Pavel Stehule | 2017-12-26 13:56:47 | Re: Ethiopian calendar year(DATE TYPE) are different from the Gregorian calendar year |
Previous Message | Lelisa Diriba | 2017-12-26 13:46:33 | Ethiopian calendar year(DATE TYPE) are different from the Gregorian calendar year |