Re: postgres_fdw test timeouts

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: Alexander Lakhin <exclusion(at)gmail(dot)com>
Cc: Nathan Bossart <nathandbossart(at)gmail(dot)com>, pgsql-hackers(at)postgresql(dot)org
Subject: Re: postgres_fdw test timeouts
Date: 2023-12-07 20:55:58
Message-ID: CA+hUKGLKZhs8Y-r1MxTDdiDvtbgaAGnCL=br5i4LDKGYd+e9YA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 7, 2023 at 10:00 PM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> I think, I've found out what's going on here.
> The culprit is WSAEnumNetworkEvents() assisted by non-trivial logic of
> ExecAppendAsyncEventWait().
> For the case noccurred > 1, ExecAppendAsyncEventWait() performs a loop,
> where ExecAsyncNotify() is called for the first AsyncRequest, but the
> second one also processed inside, through a recursive call to
> ExecAppendAsyncEventWait():
> -> ExecAsyncNotify -> produce_tuple_asynchronously
> -> ExecScan -> ExecInterpExpr -> ExecSetParamPlan -> ExecProcNodeFirst
> -> ExecAgg -> agg_retrieve_direct -> ExecProcNodeInstr -> ExecAppend
> -> ExecAppendAsyncEventWait.
> Here we get into the first loop and call ExecAsyncConfigureWait() for the
> second AsyncRequest (because we haven't reset it's callback_pending yet),
> and it leads to creating another WaitEvent for the second socket inside
> postgresForeignAsyncConfigureWait():
> AddWaitEventToSet(set, WL_SOCKET_READABLE, PQsocket(fsstate->conn), ...

Oh, wow. Nice detective work! Thank you for figuring that out.

> So it looks like we have the same issue with multiple event handles
> associated with a single socket here.
> And v2-0003-Redesign-Windows-socket-event-management.patch from [1]
> "surprisingly" helps in this case as well (I could not see a failure for
> 100 iterations of 10 tests in parallel).

Yeah, this makes perfect sense.

So, commit 04a09ee is not guilty. But as the saying goes, "no good
deed goes unpunished", and work on our Windows port seems to be
especially prone to backfiring when kludges combine...

Now we have the question of whether to go forwards (commit the "socket
table" thing), or backwards (revert 04a09ee for now to clear the CI
failures). I don't love the hidden complexity of the socket table and
am not in a hurry to commit it, but I don't currently see another
way... on the other hand we have other CI flapping due to that problem
too so reverting 04a09ee would be sweeping problems under the carpet.
I still need to process your feedback/discoveries on that other thread
and it may take a few weeks for me to get to it.

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Bruce Momjian 2023-12-07 21:12:18 Re: [CAUTION!! freemail] Re: Partial aggregates pushdown
Previous Message Dave Cramer 2023-12-07 20:46:37 Re: errors building on windows using meson