Re: BUG #18168: Parallel worker failed to initialize: could not create inherited socket: error code 10106

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: "Boyer, Maxime (he/him | il/lui)" <Maxime(dot)Boyer(at)cra-arc(dot)gc(dot)ca>
Cc: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, "pgsql-bugs(at)lists(dot)postgresql(dot)org" <pgsql-bugs(at)lists(dot)postgresql(dot)org>
Subject: Re: BUG #18168: Parallel worker failed to initialize: could not create inherited socket: error code 10106
Date: 2023-10-25 21:23:03
Message-ID: CA+hUKGJQrzNrXn1us_sYC9Djh9p7AQ1uPHWAjWhLzhX5YV-35w@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

On Thu, Oct 26, 2023 at 3:44 AM Boyer, Maxime (he/him | il/lui)
<Maxime(dot)Boyer(at)cra-arc(dot)gc(dot)ca> wrote:
> > FWIW, the PG code that throws that error message is old enough to vote;
> > it's not something we changed in a recent minor release.
>
> Yeah, that's what I thought :'D
>
> > I am guessing you saw the impact of some external event, but I don't know what.
>
> Fair enough. This happened the day after reverting to 11, because of the memory error on 14, but I also doubt it's related. I was stopping one of the application node at the time. Maybe a Windows thing, or something related to the firmware updates.

Re-bonjour Maxime,

FWIW that comes from WSASocket() trying to inherit/duplicate a socket
used for communication with the pgstat process (a process and a socket
that don't exist in PostgreSQL 15, where that mechanism was replaced
with a new shared memory system; but given you were trying to upgrade
to 14 you probably don't want to hear about 15 today...).

I have no idea why that would happen, but for the record the manual[1] says:

"WSAEPROVIDERFAILEDINIT
10106
Service provider failed to initialize. The requested service provider
could not be loaded or initialized. This error is returned if either a
service provider's DLL could not be loaded (LoadLibrary failed) or the
provider's WSPStartup or NSPStartup function failed."

That seems pretty low level. If this were PostgreSQL's fault I
suppose it would have to come from corruption of the WSAPROTOCOL_INFO
struct (a sort of cookie we need to duplicate the socket), but I doubt
it. I see there were a few reports years ago about this error message
from pre-parallel-query times. It's interesting that you see this
specifically with parallel workers (which inherits only a pgstat
socket, not with the client connection socket. The pgstat socket is
different in that it is a UDP socket. I wonder if there is something
special about UDP that is upsetting your network stack, perhaps a
firewall thing somewhere that is upset specifically by some limit on
UDP activity or something. But I'm not a Windows guy so I have no
real clue.

[1] https://learn.microsoft.com/en-us/windows/win32/winsock/windows-sockets-error-codes-2

In response to

Browse pgsql-bugs by date

  From Date Subject
Next Message Amit Langote 2023-10-26 02:27:00 Re: AW: AW: BUG #18147: ERROR: invalid perminfoindex 0 in RTE with relid xxxxx
Previous Message Bruce Momjian 2023-10-25 18:36:50 Re: missing requirement on ccache in postgresql16-devel