Quick Links

Re: Some 9.5beta2 backend processes not terminating properly?

From:	Andres Freund <andres(at)anarazel(dot)de>
To:	Amit Kapila <amit(dot)kapila16(at)gmail(dot)com>
Cc:	Petr Jelinek <petr(at)2ndquadrant(dot)com>, Shay Rojansky <roji(at)roji(dot)org>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject:	Re: Some 9.5beta2 backend processes not terminating properly?
Date:	2016-01-02 15:20:58
Message-ID:	20160102152058.g4vrlarute2vrmgz@alap3.anarazel.de
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 2016-01-02 15:40:03 +0100, Andres Freund wrote:
> I wonder if the following is the problem: The docs for WSAEventSelect()
> says:
> "Having successfully recorded the occurrence of the network event (by
> setting the corresponding bit in the internal network event record) and
> signaled the associated event object, no further actions are taken for
> that network event until the application makes the function call that
> implicitly reenables the setting of that network event and signaling of
> the associated event object."
> and also notes specifically for FD_CLOSE that there's no re-enabling
> functions.
>
> See
> https://msdn.microsoft.com/en-us/library/windows/desktop/ms741576%28v=vs.85%29.aspx
> which goes on to talk about some level triggered events (FD_READ, ...)
> and others being edge triggered. It's not clear to me from that whether
> FD_CLOSE is supposed to be edge or level triggered.
>
> If FD_CLOSE is indeed edge and not level triggered - which imo would be
> supremely insane - we'd be in trouble. It'd explain why some failures
> are noticed and others not.

I found a few more resources confirming that FD_CLOSE is edge
triggered. Which probably doesn't just make our code buggy when waiting
twice on the same socket, but probably also makes it very timing
dependent: As the event is only triggered when the close actually occurs
it's possible that we don't have any event associated with that socket:
We only do so for shorts amount of time in WaitLatchOrSocket() and
pgwin32_waitforsinglesocket().

A bit of searching around brought up that we saw issues around this
before:
http://www.postgresql.org/message-id/4351.1336927207@sss.pgh.pa.us

I really right now can see only two somewhat surgical fixes:

1) We do a nonblocking or select() *after* registering our events. Both
in WaitLatchOrSocket() and waitforsinglesocket. Since select/poll are
explicitly level triggered, that should make us notice any events we
might have missed. select() appears to have been available for a fair
while.

2) We explicitly shutdown(SD_BOTH) the socket whenever we get a FD_CLOSE
object. I *think* this should trigger errors in WSArecv, WSAEventSelect
et al. Doesn't solve the problem that we might miss important events
though.

Given 2) isn't a complete fix and I can't find reliable documentation
since when shutdown() is supported I'm inclined to go with 1).

Better ideas?

Greetings,

Andres Freund

In response to

Re: Some 9.5beta2 backend processes not terminating properly? at 2016-01-02 14:40:03 from Andres Freund

Responses

Re: Some 9.5beta2 backend processes not terminating properly? at 2016-01-02 16:14:40 from Andres Freund
Re: Some 9.5beta2 backend processes not terminating properly? at 2016-01-02 17:28:10 from Tom Lane
Re: Some 9.5beta2 backend processes not terminating properly? at 2016-01-02 20:11:42 from Tom Lane

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2016-01-02 16:14:40	Re: Some 9.5beta2 backend processes not terminating properly?
Previous Message	Andreas Seltenreich	2016-01-02 15:06:28	Re: [sqlsmith] Failing assertions in spgtextproc.c