From: | Magnus Hagander <magnus(at)hagander(dot)net> |
---|---|
To: | Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com> |
Cc: | Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Luke Koops <luke(dot)koops(at)entrust(dot)com>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |
Date: | 2009-08-03 14:16:49 |
Message-ID: | 9837222c0908030716r416d789ewef3f77c65b39d916@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
On Mon, Aug 3, 2009 at 16:12, Nikhil
Sontakke<nikhil(dot)sontakke(at)enterprisedb(dot)com> wrote:
> Hi,
>
>>>>>
>>>>>> ntdll.dll!NtWaitForMultipleObjects+0xc
>>>>>> kernel32.dll!WaitForMultipleObjectsEx+0x11a
>>>>>> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
>>>>>> postgres.exe!pgwin32_recv+0x90
>>>>>> postgres.exe!PgstatCollectorMain+0x17f
>>>>>> postgres.exe!SubPostmasterMain+0x33a
>>>>>> postgres.exe!main+0x168
>>>>>> postgres.exe!__tmainCRTStartup+0x10f
>>>>>> kernel32.dll!BaseProcessStart+0x23
>>>>>
>>>>> I have seen this problem too. The process seems stuck for no good
>>>>> reason. I wondered at the time if it could be a kernel issue. I
>>>>> remember trying to send some data to the collector to verify whether
>>>>> it'd wake up, but no luck. (I mean I couldn't find a way to do it on
>>>>> Windows).
>>>>
>>>> I have seen this as well, but only in cases where there has been
>>>> broken firewall software or such things involved. I have seen a couple
>>>> of reports from the field though.
>>>>
>>>> Anyway, this really is a should-never-happen thing. As soon as a new
>>>> packet is sent in, WaitForMultipleObjectsEx() should return right
>>>> away. And given that backends regularly send packets over, it
>>>> shouldn't be an issue even if we miss one...
>>>>
>>>
>>> And this fact should lend credence to Alvaro's (as well as mine)
>>> suspicions that it seems to be a Windows kernel issue.
>>>
>>> As a consequence, Magnus I was wondering if having a loop similar to
>>> the WRITE handling of waiting for a fixed timeout in a loop (rather
>>> than an INFINITE call to WaitForMultipleObjectsEx) inside the
>>> pgwin32_waitforsinglesocket() function will help for the READ case
>>> too? I believe Teogor Sigaev had raised a similar concern a while back
>>> about it:
>>>
>>> http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.html
>>
>> Maybe. I'm unsure if it's enough to just try another
>> WaitForSingleObjectEx() on it, or if we need to actually issue a
>> WSARecv() on it as well. Maybe it would be enough to just change the
>> INIFINTE on line 318 of socket.c to a fixed value. That will loop down
>> to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us
>> to do a short sleep and come back. But we'd have to change the limit
>> of 5 somehow then, since in theory we should wait forever. Maybe that
>> outer loop should just be a for(;;), what do you think?
>>
>
> Yes, line 318 seems to be a much better location to me. If Windows and
> it's socket logic behaves properly most of the times :), most of the
> calls should come out within the first few tries, so changing 5 to an
> infinite loop shouldn't hurt those normal use cases in theory.
>
> OTOH, I was wondering what if we kill the stats collector and on a
> restart the socket communication resumes properly. Would that
> conclusively mean that it is a flaw in our code?
No, if we kill the stats collector that will destroy all sockets, and
when the new one starts all the sockets it operates on are fresh and
new. So it doesn't show that the flaw is in our code - but it also
doesn't show that it's in the kernel or runtime libraries.
--
Magnus Hagander
Self: http://www.hagander.net/
Work: http://www.redpill-linpro.com/
From | Date | Subject | |
---|---|---|---|
Next Message | Nikhil Sontakke | 2009-08-03 14:20:26 | Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |
Previous Message | Nikhil Sontakke | 2009-08-03 14:12:48 | Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |