WaitEventSetWaitBlock() can still hang on Windows due to connection reset

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Subject: WaitEventSetWaitBlock() can still hang on Windows due to connection reset
Date: 2024-10-22 10:00:00
Message-ID: 93515d62-edce-9041-ec6e-7122f6e92bea@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

[ a follow-up to [1] separated to focus on this issue ]

Let me bring your attention to a failure [2], presented by drongo in the
last week, with the following diagnostics:
[10:30:30.954](2.110s) ok 5 - compare primary and standby dumps
### Restarting node "primary"
# Running: pg_ctl -w -D
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/027_stream_regress\\data/t_027_stream_regress_primary_data/pgdata
-l C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/027_stream_regress\\log/027_stream_regress_primary.log restart
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
# Postmaster PID for node "primary" is 8908
Waiting for replication conn standby_1's replay_lsn to pass 0/158C8B98 on primary
[10:41:32.115](661.161s) # poll_query_until timed out executing this query:
# SELECT '0/158C8B98' <= replay_lsn AND state = 'streaming'
#          FROM pg_catalog.pg_stat_replication
#          WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#

027_stream_regress_standby_1.log contains:
2024-10-14 10:30:28.483 UTC [4320:12] 027_stream_regress.pl LOG: disconnection: session time: 0:00:03.793 user=pgrunner
database=postgres host=127.0.0.1 port=61748
2024-10-14 10:30:31.442 UTC [8468:2] LOG:  replication terminated by primary server
2024-10-14 10:30:31.442 UTC [8468:3] DETAIL:  End of WAL reached on timeline 1 at 0/158C8B98.
2024-10-14 10:30:31.442 UTC [8468:4] FATAL:  could not send end-of-streaming message to primary: server closed the
connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
    no COPY in progress
2024-10-14 10:30:31.443 UTC [5452:7] LOG:  invalid resource manager ID 101 at 0/158C8B98
2024-10-14 10:35:06.986 UTC [8648:21] LOG:  restartpoint starting: time
2024-10-14 10:35:06.991 UTC [8648:22] LOG:  restartpoint complete: wrote 0 buffers (0.0%), wrote 1 SLRU buffers; 0 WAL
file(s) added, 0 removed, 1 recycled; write=0.001 s, sync=0.001 s, total=0.005 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=15336 kB, estimate=69375 kB; lsn=0/158C8B20, redo lsn=0/158C8B20
2024-10-14 10:35:06.991 UTC [8648:23] LOG:  recovery restart point at 0/158C8B20
2024-10-14 10:35:06.991 UTC [8648:24] DETAIL:  Last completed transaction was at log time 2024-10-14 10:30:24.820804+00.
2024-10-14 10:41:32.510 UTC [4220:4] LOG:  received immediate shutdown request

(That is, primary was restarted, but standby didn't reconnect to it,
waiting for something...)

$node_primary->restart was added to 027_stream_regress.pl on 2024-06-27
(with 0844b3968), so this can explain why is it the first noticed failure
of this kind.

I managed to reproduce this failure locally, then I reduced the 027 test
to just restarting primary in a loop, and with debug logging added (please
find attached) I could see what makes standby hang:
2024-10-22 08:42:20.016 UTC [6652] DETAIL:  End of WAL reached on timeline 1 at 0/5D14C98.
2024-10-22 08:42:20.016 UTC [6652] FATAL:  could not send end-of-streaming message to primary: server closed the
connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
    no COPY in progress
2024-10-22 08:42:20.016 UTC [21268] LOG:  invalid record length at 0/5D14C98: expected at least 24, got 0
2024-10-22 08:42:20.088 UTC [8256] LOG:  !!!WalReceiverMain| before walrcv_connect
!!!libpqrcv_connect| before loop
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 1
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 64
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10054
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
2024-10-22 08:44:49.399 UTC [15352] LOG:  restartpoint starting: time
...

While for a previous successful reconnect we can see:
2024-10-22 08:42:17.275 UTC [6652] LOG:  !!!WalReceiverMain| before walrcv_connect
!!!libpqrcv_connect| before loop
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 1
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 64
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
    !!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
    !!!WaitEventSetWaitBlock| before WaitForMultipleObjects
    !!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| after loop

(The full log is attached too.)

In other words, if WSARecv() fails with WSAECONNRESET, calling
WaitForMultipleObjects() still leads to the hang.

I reproduced this on a8458f508~1 and on 97d891010, so the issue itself is
not new...

[1] https://www.postgresql.org/message-id/CA%2BhUKGKn%2BxWjaEnQ%2BiYYs7STG7_nNP5w5VMb1vSvQfAkkOxOUA%40mail.gmail.com
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-10-14%2010%3A08%3A17

Best regards,
Alexander

Attachment Content-Type Size
099_primary_restart.pl application/x-perl 1.4 KB
099_primary_restart_standby_1.log text/x-log 190.3 KB
walreceiver-win-socket-debugging.patch text/x-patch 3.9 KB

Browse pgsql-hackers by date

  From Date Subject
Next Message Tatsuo Ishii 2024-10-22 10:19:09 Re: Row pattern recognition
Previous Message Heikki Linnakangas 2024-10-22 09:43:34 Re: POC: make mxidoff 64 bits