Re: BUG #9118: WAL Sender does not disconnect replication clients during shutdown

From: Fujii Masao <masao(dot)fujii(at)gmail(dot)com>
To: Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
Cc: jhedden(at)apple(dot)com, pgsql-bugs <pgsql-bugs(at)postgresql(dot)org>
Subject: Re: BUG #9118: WAL Sender does not disconnect replication clients during shutdown
Date: 2014-03-13 10:52:28
Message-ID: CAHGQGwGvfRW+hYLLOSM2CP-mg7qQFmu+GCdfiu9_1AKWdpxMdw@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Sorry for the delay...

On Thu, Feb 6, 2014 at 5:05 PM, Heikki Linnakangas
<hlinnakangas(at)vmware(dot)com> wrote:
> On 02/06/2014 05:08 AM, jhedden(at)apple(dot)com wrote:
>>
>> The following bug has been logged on the website:
>>
>> Bug reference: 9118
>> Logged by: Joel Hedden
>> Email address: jhedden(at)apple(dot)com
>> PostgreSQL version: 9.3.2
>> Operating system: Mac OS X 10.9.1
>> Description:
>>
>> I connect a pg_receivexlog instance and have "hot_standby" archiving
>> enabled, with "archive_command" defined correctly. When the WAL Sender
>> process receives a SIGUSR2 from the postmaster (or me), it fails to shut
>> down and pg_receivexlog remains connected. Upon inspection, it looks like
>> the test for "sentPtr == MyWalSnd->flush" is always false at
>> walsender.c:1058 (sentPtr is still non-zero) where the wal sender should
>> be
>> shutting down. Replication and archiving seem to be working otherwise.
>> Killing pg_receivexlog allows for the WAL Sender to terminate.
>
>
> Hmm. Before exiting, walsender waits until the client has flushed all the
> WAL to disk. However, pg_receivexlog never sends a "flush" pointer back to
> the server, so the server waits forever.
>
> The first question is, why does pg_receivexlog not send its "flush" pointer
> back to the server? It *does* fsync the files to disk. However, currently it
> only fsyncs when closing a full segment, but when shutting down, the last
> segment would not be full, so to fix this issue it should be taught to fsync
> also partial segments.

Yes. And, pg_receivexlog returns InvalidXLogRecPtr as the flush location,
so "sentPtr == MyWalSnd->flush" will never be true when using pg_receivexlog...
The quick-fix seems not to wait for that condition to be true whenever the flush
location is invalid.

> Fujii-san, how can walreceiver detect the closure of the connection, before
> reading all the buffered WAL from the TCP connection? What kind of log
> messages do you get when it happens?

I got the following messages.

[MASTER]
LOG: database system is shut down

[STANDBY]
FATAL: could not send data to WAL stream: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.

> I tried to reproduce that with commit
> bee4a4d361c054c531c3a27024f9ff3efef3635b reverted, but couldn't. Although
> this was with master and standby running on same laptop, and this is
> essentially a race condition, so it's possible that I just didn't get the
> timing right to make it happen.

You would need to enable WAL archiving. Whenever I was able to
reproduce the problem,
I enabled WAL archiving.

Regards,

--
Fujii Masao

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Marko Tiikkaja 2014-03-13 12:55:11 Re: BUG #9557: Row not visible after receiving notification
Previous Message marko 2014-03-13 09:39:22 BUG #9557: Row not visible after receiving notification