From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
Subject: | WaitEventSetWaitBlock() can still hang on Windows due to connection reset |
Date: | 2024-10-22 10:00:00 |
Message-ID: | 93515d62-edce-9041-ec6e-7122f6e92bea@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello hackers,
[ a follow-up to [1] separated to focus on this issue ]
Let me bring your attention to a failure [2], presented by drongo in the
last week, with the following diagnostics:
[10:30:30.954](2.110s) ok 5 - compare primary and standby dumps
### Restarting node "primary"
# Running: pg_ctl -w -D
C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/027_stream_regress\\data/t_027_stream_regress_primary_data/pgdata
-l C:\\prog\\bf\\root\\HEAD\\pgsql.build/testrun/recovery/027_stream_regress\\log/027_stream_regress_primary.log restart
waiting for server to shut down.... done
server stopped
waiting for server to start.... done
server started
# Postmaster PID for node "primary" is 8908
Waiting for replication conn standby_1's replay_lsn to pass 0/158C8B98 on primary
[10:41:32.115](661.161s) # poll_query_until timed out executing this query:
# SELECT '0/158C8B98' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
#
027_stream_regress_standby_1.log contains:
2024-10-14 10:30:28.483 UTC [4320:12] 027_stream_regress.pl LOG: disconnection: session time: 0:00:03.793 user=pgrunner
database=postgres host=127.0.0.1 port=61748
2024-10-14 10:30:31.442 UTC [8468:2] LOG: replication terminated by primary server
2024-10-14 10:30:31.442 UTC [8468:3] DETAIL: End of WAL reached on timeline 1 at 0/158C8B98.
2024-10-14 10:30:31.442 UTC [8468:4] FATAL: could not send end-of-streaming message to primary: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
2024-10-14 10:30:31.443 UTC [5452:7] LOG: invalid resource manager ID 101 at 0/158C8B98
2024-10-14 10:35:06.986 UTC [8648:21] LOG: restartpoint starting: time
2024-10-14 10:35:06.991 UTC [8648:22] LOG: restartpoint complete: wrote 0 buffers (0.0%), wrote 1 SLRU buffers; 0 WAL
file(s) added, 0 removed, 1 recycled; write=0.001 s, sync=0.001 s, total=0.005 s; sync files=0, longest=0.000 s,
average=0.000 s; distance=15336 kB, estimate=69375 kB; lsn=0/158C8B20, redo lsn=0/158C8B20
2024-10-14 10:35:06.991 UTC [8648:23] LOG: recovery restart point at 0/158C8B20
2024-10-14 10:35:06.991 UTC [8648:24] DETAIL: Last completed transaction was at log time 2024-10-14 10:30:24.820804+00.
2024-10-14 10:41:32.510 UTC [4220:4] LOG: received immediate shutdown request
(That is, primary was restarted, but standby didn't reconnect to it,
waiting for something...)
$node_primary->restart was added to 027_stream_regress.pl on 2024-06-27
(with 0844b3968), so this can explain why is it the first noticed failure
of this kind.
I managed to reproduce this failure locally, then I reduced the 027 test
to just restarting primary in a loop, and with debug logging added (please
find attached) I could see what makes standby hang:
2024-10-22 08:42:20.016 UTC [6652] DETAIL: End of WAL reached on timeline 1 at 0/5D14C98.
2024-10-22 08:42:20.016 UTC [6652] FATAL: could not send end-of-streaming message to primary: server closed the
connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
no COPY in progress
2024-10-22 08:42:20.016 UTC [21268] LOG: invalid record length at 0/5D14C98: expected at least 24, got 0
2024-10-22 08:42:20.088 UTC [8256] LOG: !!!WalReceiverMain| before walrcv_connect
!!!libpqrcv_connect| before loop
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 1
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
!!!WaitEventSetWaitBlock| before WaitForMultipleObjects
!!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 64
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
!!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10054
!!!WaitEventSetWaitBlock| before WaitForMultipleObjects
2024-10-22 08:44:49.399 UTC [15352] LOG: restartpoint starting: time
...
While for a previous successful reconnect we can see:
2024-10-22 08:42:17.275 UTC [6652] LOG: !!!WalReceiverMain| before walrcv_connect
!!!libpqrcv_connect| before loop
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 1
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 64)
!!!WaitEventSetWaitBlock| before WaitForMultipleObjects
!!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 64
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
!!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
!!!WaitEventSetWaitBlock| before WaitForMultipleObjects
!!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 4)
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 4
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
!!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
!!!WaitEventSetWaitBlock| before WaitForMultipleObjects
!!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
!!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
!!!WaitEventSetWaitBlock| before WaitForMultipleObjects
!!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| before WaitLatchOrSocket(io_flag: 2)
!!!WaitEventSetWaitBlock| WL_SOCKET_READABLE, cur_event->fd: 844, WSAGetLastError(): 10035
!!!WaitEventSetWaitBlock| before WaitForMultipleObjects
!!!WaitEventSetWaitBlock| WaitForMultipleObjects returned: 3
!!!libpqrcv_connect| after WaitLatchOrSocket, rc: 2
!!!libpqrcv_connect| after loop
(The full log is attached too.)
In other words, if WSARecv() fails with WSAECONNRESET, calling
WaitForMultipleObjects() still leads to the hang.
I reproduced this on a8458f508~1 and on 97d891010, so the issue itself is
not new...
[1] https://www.postgresql.org/message-id/CA%2BhUKGKn%2BxWjaEnQ%2BiYYs7STG7_nNP5w5VMb1vSvQfAkkOxOUA%40mail.gmail.com
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=drongo&dt=2024-10-14%2010%3A08%3A17
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
099_primary_restart.pl | application/x-perl | 1.4 KB |
099_primary_restart_standby_1.log | text/x-log | 190.3 KB |
walreceiver-win-socket-debugging.patch | text/x-patch | 3.9 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Tatsuo Ishii | 2024-10-22 10:19:09 | Re: Row pattern recognition |
Previous Message | Heikki Linnakangas | 2024-10-22 09:43:34 | Re: POC: make mxidoff 64 bits |