Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Noah Misch <noah(at)leadboat(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only
Date: 2023-10-09 03:22:52
Message-ID: CA+hUKGKyc_K0bF-j1ds0yqxhiO9O5=qaJY5N3Km2BLMci1omfg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 9, 2023 at 3:25 PM Noah Misch <noah(at)leadboat(dot)com> wrote:
> The "fd >= FD_SETSIZE" check is irrelevant on Windows. See comments in the
> attached patch; in brief, Windows assigns FDs and uses FD_SETSIZE differently.
> The first associated failure was commit dea12a1 (2023-08-03); as a doc commit,
> it's an innocent victim. Bisect blamed 8488bab "ci: Use windows VMs instead
> of windows containers" (2023-02), long before the failures began. I'll guess
> some 2023-08 Windows update or reconfiguration altered file descriptor
> assignment, hence the onset of failures. In my tests of v16, the highest file
> descriptor was 948. I could make v16 fail by changing --client=5 to
> --client=90 in the test. With the attached patch and --client=90, v16 peaked
> at file descriptor 2040.

Ohhh. Thanks for figuring that out. I'd never noticed that quirk. I
didn't/can't test it but the patch looks reasonable after reading the
referenced docs. Maybe instead of just "out of range" I'd say "out of
range for select()" since otherwise it might seem a little baffling --
what range?

Random water cooler speculation about future ideas: I wonder
whether/when we can eventually ditch select() and use WSAPoll() for
this on Windows, which is supposed to be like poll(). There's a
comment explaining that we prefer select() because it has a higher
resolution sleep than poll() (us vs ms), so we wouldn't want to use
poll() on Unixen, but we know that Windows doesn't even use high
resolution timers for any user space APIs anyway so that's just not a
concern on that platform. The usual reason nobody ever uses WSAPoll()
is because the Curl guys told[1] everyone that it's terrible in a way
that would quite specifically break our usage. But I wonder, because
the documentation now says "As of Windows 10 version 2004, when a TCP
socket fails to connect, (POLLHUP | POLLERR | POLLWRNORM) is
indicated", it *sounds* like it might have been fixed ~3 years ago?
One idea would be to hide it inside WaitEventSet, and let WaitEventSet
be used in front end code (we couldn't use the
WaitForMultipleObjects() version because it's hard-limited to 64
events, but in front end code we don't need latches and other stuff,
so we could have a sockets-only version with WSAPoll()). The idea of
using WaitEventSet for pgbench on Unix was already mentioned a couple
of times by others for general scalability reasons -- that way we
could finish up using a better kernel interface on all supported
platforms. We'd probably want to add high resolution time-out support
(I already posted a patch for that somewhere), or we'd be back to 1ms
timeouts.

[1] https://daniel.haxx.se/blog/2012/10/10/wsapoll-is-broken/

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Noah Misch 2023-10-09 04:08:46 Re: REL_15_STABLE: pgbench tests randomly failing on CI, Windows only
Previous Message David G. Johnston 2023-10-09 02:58:15 Re: Fix output of zero privileges in psql