From: | Nikhil Sontakke <nikhil(dot)sontakke(at)enterprisedb(dot)com> |
---|---|
To: | Magnus Hagander <magnus(at)hagander(dot)net> |
Cc: | Alvaro Herrera <alvherre(at)commandprompt(dot)com>, Luke Koops <luke(dot)koops(at)entrust(dot)com>, pgsql-bugs(at)postgresql(dot)org |
Subject: | Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |
Date: | 2009-08-03 14:12:48 |
Message-ID: | a301bfd90908030712h3668b135g16cce3a12c332a93@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-bugs |
Hi,
>>>>
>>>>> ntdll.dll!NtWaitForMultipleObjects+0xc
>>>>> kernel32.dll!WaitForMultipleObjectsEx+0x11a
>>>>> postgres.exe!pgwin32_waitforsinglesocket+0x1ed
>>>>> postgres.exe!pgwin32_recv+0x90
>>>>> postgres.exe!PgstatCollectorMain+0x17f
>>>>> postgres.exe!SubPostmasterMain+0x33a
>>>>> postgres.exe!main+0x168
>>>>> postgres.exe!__tmainCRTStartup+0x10f
>>>>> kernel32.dll!BaseProcessStart+0x23
>>>>
>>>> I have seen this problem too. The process seems stuck for no good
>>>> reason. I wondered at the time if it could be a kernel issue. I
>>>> remember trying to send some data to the collector to verify whether
>>>> it'd wake up, but no luck. (I mean I couldn't find a way to do it on
>>>> Windows).
>>>
>>> I have seen this as well, but only in cases where there has been
>>> broken firewall software or such things involved. I have seen a couple
>>> of reports from the field though.
>>>
>>> Anyway, this really is a should-never-happen thing. As soon as a new
>>> packet is sent in, WaitForMultipleObjectsEx() should return right
>>> away. And given that backends regularly send packets over, it
>>> shouldn't be an issue even if we miss one...
>>>
>>
>> And this fact should lend credence to Alvaro's (as well as mine)
>> suspicions that it seems to be a Windows kernel issue.
>>
>> As a consequence, Magnus I was wondering if having a loop similar to
>> the WRITE handling of waiting for a fixed timeout in a loop (rather
>> than an INFINITE call to WaitForMultipleObjectsEx) inside the
>> pgwin32_waitforsinglesocket() function will help for the READ case
>> too? I believe Teogor Sigaev had raised a similar concern a while back
>> about it:
>>
>> http://www.nabble.com/-GENERAL--Stats-collector-frozen--td8569977i20.html
>
> Maybe. I'm unsure if it's enough to just try another
> WaitForSingleObjectEx() on it, or if we need to actually issue a
> WSARecv() on it as well. Maybe it would be enough to just change the
> INIFINTE on line 318 of socket.c to a fixed value. That will loop down
> to WSARecv() which should exit with WSAEWOULDBLOCK which will cause us
> to do a short sleep and come back. But we'd have to change the limit
> of 5 somehow then, since in theory we should wait forever. Maybe that
> outer loop should just be a for(;;), what do you think?
>
Yes, line 318 seems to be a much better location to me. If Windows and
it's socket logic behaves properly most of the times :), most of the
calls should come out within the first few tries, so changing 5 to an
infinite loop shouldn't hurt those normal use cases in theory.
OTOH, I was wondering what if we kill the stats collector and on a
restart the socket communication resumes properly. Would that
conclusively mean that it is a flaw in our code?
Regards,
Nikhils
> From what I understand, none of you have an environment where you can
> reliably reproduce this? That means it's going to be a PITA to try to
> figure out if we're actually fixing anything :S
>
>
> --
> Magnus Hagander
> Self: http://www.hagander.net/
> Work: http://www.redpill-linpro.com/
>
From | Date | Subject | |
---|---|---|---|
Next Message | Magnus Hagander | 2009-08-03 14:16:49 | Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |
Previous Message | Magnus Hagander | 2009-08-03 13:56:35 | Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |