Windows socket problems, interesting connection to AIO

From: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Windows socket problems, interesting connection to AIO
Date: 2024-09-02 09:20:21
Message-ID: CA+hUKGLR10ZqRCvdoRrkQusq75wF5=vEetRSs2_u1s+FAUosFQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

There's a category[1] of random build farm/CI failures where Windows
behaves differently and our stuff breaks, which also affects end
users. A recent BF failure[2] that looks like one of those jangled my
nerves when I pushed a commit, so I looked into a new theory on how to
fix it. First, let me restate my understanding of the two categories
of known message loss on Windows, since the information is scattered
far and wide across many threads:

1. When a backend exits without closing the socket gracefully, which
was briefly fixed[3] but later reverted because it broke something
else, a Windows server's network stack might fail to send data that it
had buffered but not yet physically sent[4].

The reason we reverted that and went back to abortive socket shutdown
(ie just exit()) is that our WL_SOCKET_READABLE was buggy, and could
miss FD_CLOSE events from graceful but not abortive shutdowns (which
keep reporting themselves repeatedly, something to do with being an
error state (?)). Sometimes a libpq socket we're waiting for with
WaitLatchOrSocket() on the client end of the socket could hang
forever. Concretely: a replication connection or postgres_fdw running
inside another PostgreSQL server. We fixed that event loss, albeit in
a gross kludgy way[5], because other ideas seemed too complicated (to
wit, various ways to manage extra state associated with each socket,
really hard to retro-fit in a satisfying way). Graceful shutdown
should fix the race cases where the next thing the client calls is
recv(), as far as I know.

2. If a Windows client tries to send() and gets an ECONNRESET/EPIPE
error, then the network stack seems to drop already received data, so
a following recv() will never see it. In other words, it depends on
whether the application-level protocol is strictly request/response
based, or has sequence points at which both ends might send(). AFAIK
the main consequence for real users is that FATAL recovery conflict,
idle termination, etc messages are not delivered to clients, leaving
just "server closed the connection unexpectedly".

I have wondered whether it might help to kludgify the Windows TCP code
even more by doing an extra poll() for POLLRD before every single
send(). "Hey network stack, before I try to send this message, is
there anything the server wanted to tell me?", but I guess that must
be racy because the goodbye message could arrive between poll() and
send(). Annoyingly, I suspect it would *mostly* work.

The new thought I had about the second category of problem is: if you
use asynchronous networking APIs, then the kernel *can't* throw your
data out, because it doesn't even have it. If the server's FATAL
message arrives before the client calls send(), then the data is
already written to user space memory and the I/O is marked as
complete. If it arrives after, then there's no issue, because
computers can't see into the future yet. That's my hypothesis,
anyway. To try that, I started with a very simple program[6] on my
local FreeBSD system that does a failing send, and tries synchronous
and asynchronous recv():

=== synchronous ===
send -> -1, error = 32
recv -> "FATAL: flux capacitor failed", error = 0
=== posix aio ===
send -> -1, error = 32
async recv -> "FATAL: flux capacitor failed", error = 0

... and then googled enough Windows-fu to translate it and run it on
CI, and saw the known category 2 failure with the plain old
synchronous version. The good news is that the async version sees the
goodbye message:

=== synchronous ===
send -> 14, error = 0
recv -> "", error = 10054
=== windows overlapped ===
send -> 14, error = 0
async recv -> "FATAL: flux capacitor failed", error = 0

That's not the same as a torture test for weird timings, and I have
zero knowledge of the implementation of this stuff, but I currently
can't imagine how it could possibly be implemented in any way that
could give a different answer.

Perhaps we could figure out a way to use that API to simulate
synchronous recv() built on top of that stuff, but I think a more
satisfying use of our time and energy would be to redesign all our
networking code to do cross-platform AIO. I think that will mostly
come down to a bunch of network buffer management redesign work.
Anyway, I don't have anything concrete there, I just wanted to share
this observation.

[1] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Miscellaneous_tests_fail_on_Windows_due_to_a_connection_closed_before_receiving_a_final_error_message
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-08-31%2007%3A54%3A58
[3] https://github.com/postgres/postgres/commit/6051857fc953a62db318329c4ceec5f9668fd42a
[4] https://learn.microsoft.com/en-us/windows/win32/winsock/graceful-shutdown-linger-options-and-socket-closure-2
[5] https://github.com/postgres/postgres/commit/a8458f508a7a441242e148f008293128676df003
[6] https://github.com/macdice/hello-windows/blob/socket-hacking/test.c

Browse pgsql-hackers by date

  From Date Subject
Next Message Peter Smith 2024-09-02 09:24:20 pg_stats_subscription_stats order of the '*_count' columns
Previous Message Daniel Gustafsson 2024-09-02 08:59:11 Re: More performance improvements for pg_dump in binary upgrade mode