Re: PostgreSQL 6.2.5 Visual Studio Build does not pass the regression tests.

From: Noah Misch <noah(at)leadboat(dot)com>
To: "Shiv Shivaraju Gowda (shivshi)" <shivshi(at)cisco(dot)com>
Cc: "pgsql-bugs(at)postgresql(dot)org" <pgsql-bugs(at)postgresql(dot)org>, alvherre(at)2ndquadrant(dot)com
Subject: Re: PostgreSQL 6.2.5 Visual Studio Build does not pass the regression tests.
Date: 2014-01-24 04:27:42
Message-ID: 20140124042742.GA1975540@tornado.leadboat.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Fri, Nov 22, 2013 at 01:59:58AM +0000, Shiv Shivaraju Gowda (shivshi) wrote:
> There seems to be a has a dormant bug in PostgreSQL source code build using VisualStudio(VS) which shows up with the newer OS( not sure if it is the OS or some other thing in the environment ).

Thanks for this report, and apologies for the delay.

Please tell us as much as possible about the OS configurations that have
exhibited the problem and the OS configurations that have *not* exhibited the
problem with the same binaries.

> MinGW build doesn't access this code path and thus doesn't hit this bug.

A MinGW build certainly includes win32_sema.c, and I don't see why it wouldn't
use that code path. How did you determine that?

> SYMPTOM:
> The following symptom is encountered in PostgreSQL build using Visual Studio: PostgreSQL regression tests fail with server crashing repeatedly with this message in the log file: "PANIC: could not lock semaphore: error code 0". The issue is encountered for VS 2005, VS 2008( 32bit and 64bit executables), VS 2010 and VS 2012 built executables. The issue was reproducible with PostgreSQL 6.2.3 and 6.2.5. We didn't encounter this issue in MinGW build or EnterpriseDB Packaged executables (which seems to have been built using VisualStudio 2010).

I assume you mean 9.2.3 and 9.2.5?

> CAUSE:
> The PGSemaphoreLock function in postgresql-9.2.5\src\backend\port\win32_sema.c (https://github.com/postgres/postgres/blob/master/src/backend/port/win32_sema.c) uses "Ex" version of "WaitForMultipleObjects" Windows function (http://msdn.microsoft.com/en-us/library/windows/desktop/ms687028%28v=vs.85%29.aspx) however doesn't handle the additional awake calls from the "bAlertable" state. Specifically, it doesn't handle the WAIT_IO_COMPLETION return code when woken up by a User-mode Asynchronous Procedure Call(APC) or Async IO completion.
>
> The part I do NOT understand is why do the User-Mode APC or Async IO completion triggers get fired only in the executables we built and not in the ones built by EnterpriseDB and bits built by other users since no one has complained about it. Irrespective if it is triggered or not, the code should have handled all the return codes of the WaitForMultipleObjectsEX API and that is the reason I think it is a bug.
>
> I checked the source code for calls which will trigger the Async IO or user-mode APC(ReadFileEx<http://msdn.microsoft.com/en-us/library/windows/desktop/aa365468(v=vs.85).aspx>, WriteFileEx<http://msdn.microsoft.com/en-us/library/windows/desktop/aa365748(v=vs.85).aspx>, QueueUserAPC<http://msdn.microsoft.com/en-us/library/windows/desktop/ms684954(v=vs.85).aspx>) and could not find any. I am not sure what triggers the WAIT_IO_COMPLETION return call. I could not find a way to figure out that information in the debug environment.

Yes, that's weird. I could believe anti-malware software doing this. If any
successful test has taken place on the same machine and user account as the
unsuccessful test, that explanation is unlikely. It's also possible that
you're building against an unusual version of a 3rd-party library that, unlike
versions used in popular builds, does use APCs or async I/O.

All that being said, it's clearly wrong to call WaitForMultipleObjectsEx(...,
TRUE) and then treat WAIT_IO_COMPLETION like WAIT_FAILED.

> (Posible) FIX:
> Either of the following changes to the PGSemaphoreLock function work fine and pass the regression tests.
>
> 1) Replace the call to WaitForMultipleObjectsEx with WaitForMultipleObjects.

I don't know how we decided which waits to make alertable, so I lean against
changing this wait from alertable to normal.

> 2) Handle the WAIT_IO_COMPLETION return code same as WAIT_OBJECT_0. There is a similar code like this in socket.c, so this change should be safe too.

Sounds good. I looked through the other WaitFor*Object* callers for similar
problems. pgwin32_select() also mishandles WAIT_IO_COMPLETION, returning zero
as though the full timeout had elapsed. It should return -1 and set EINTR.
This could affect CheckRADIUSAuth() and PgstatCollectorMain(), both of which
distinguish full timeout expiration from EINTR.

Would you verify that the attached patch fixes your builds?

nm

--
Noah Misch
EnterpriseDB http://www.enterprisedb.com

Attachment Content-Type Size
w32-io-completion-v1.patch text/plain 3.1 KB

In response to

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Andrew Dunstan 2014-01-24 04:32:25 Re: BUG #8218: Error when querying an JSON data, 9.3beta
Previous Message Bruce Momjian 2014-01-24 03:05:35 Re: BUG #8218: Error when querying an JSON data, 9.3beta