From: | Alexander Lakhin <exclusion(at)gmail(dot)com> |
---|---|
To: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Noah Misch <noah(at)leadboat(dot)com>, Andrew Dunstan <andrew(at)dunslane(dot)net>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Why is src/test/modules/committs/t/002_standby.pl flaky? |
Date: | 2023-11-21 11:00:00 |
Message-ID: | 32d112ee-0b6f-d4ab-441b-e2bba66a1d83@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hello Thomas,
10.11.2023 06:31, Thomas Munro wrote:
> Here is a new attempt to fix this mess. Disclaimer: this based
> entirely on reading the manual and vicariously hacking a computer I
> don't have via CI.
As it also might (and I would like it to) be the final attempt, I decided
to gather information and all the cases that we had on this topic.
At least, for the last 5 years we've seen:
[1] 2019-01-18: Re: BUG #15598: PostgreSQL Error Code is not reported when connection terminated due to
idle-in-transaction timeout
test 099_case_15598 made (in attachment)
no commit
[2] 2019-01-22: Rare SSL failures on eelpout
references [1]
test 099_rare_ssl_failures made (in attachment)
commit 2019-03-19 1f39a1c06: Restructure libpq's hqandling of send failures.
[3] 2019-11-28: pgsql: Add tests for '-f' option in dropdb utility.
test 051_dropdb_force was proposed (included in the attachment)
commit 2019-11-30 98a9b37ba: Revert commits 290acac92b and 8a7e9e9dad.
[4] 2019-12-06: closesocket behavior in different platforms
references [1], [2], [3]; a documentation change proposed
no commit
[5] 2020-06-03: libpq copy error handling busted
test 099_pgbench_with_server_off made (in attachment)
commit 2020-06-07 7247e243a: Try to read data from the socket in pqSendSome's write_failed paths. (a fix for 1f39a1c06)
[6] 2020-10-19: BUG #16678: The ecpg connect/test5 test sometimes fails on Windows
no commit
[7] 2021-11-17: Windows: Wrong error message at connection termination
references [6]
commit: 2021-12-02 6051857fc: On Windows, close the client socket explicitly during backend shutdown.
[8] 2021-12-05 17:03:18: MSVC SSL test failure
commit: 2021-12-07 ed52c3707: On Windows, also call shutdown() while closing the client socket.
[9] 2021-12-30: Why is src/test/modules/committs/t/002_standby.pl flaky?
additional test 099_postgres_fdw_disconnect made (in attachment)
commit 2022-01-26 75674c7ec: Revert "graceful shutdown" changes for Windows, in back branches only. (REL_14_STABLE)
commit 2022-03-22 29992a6a5: Revert "graceful shutdown" changes for Windows. (master)
[10] 2022-02-02 19:19:22: BUG #17391: While using --with-ssl=openssl and PG_TEST_EXTRA='ssl' options, SSL tests fail on
OpenBSD 7.0
commit 2022-02-12 335fa5a26: Fix thinko in PQisBusy(). (a fix for 1f39a1c06)
commit 2022-02-12 faa189c93: Move libpq's write_failed mechanism down to pqsecure_raw_write(). (a fix for 1f39a1c06)
As it becomes difficult to test/check all those cases scattered around, I
decided to retell the whole story by means of tests. Please look at the
script win-sock-tests-01.cmd attached, which can be executed both on
Windows (as regular cmd) and on Linux (bash win-sock-tests-01.cmd).
At the end of the script we can see several things.
First, the last patchset you posted, applied to b282fa88d~1, fixes the
issue discussed in this thread (it eliminates failures of
commit_ts_002_standby (and also 099_postgres_fdw_disconnect)).
Second, with the sleep added (see [6]), I had the same results of
`meson test` on Windows and on Linux.
Namely, there are some tests failing (see win-sock-tests-01.cmd) due to
walsender preventing server stop.
I describe this issue separately (see details in walsender-cannot-exit.txt;
maybe it's worth to discuss it in a separate thread) as it's kind of
off-topic. With the supplementary sleep() added to WalSndLoop(), the
complete `meson test` passes successfully both on Windows and on Linux.
Third, cases [1] and [3] are still broken, due to a Windows peculiarity.
Please see server.c and client.c attached, which demonstrate:
Case "no shutdown/closesocket" on Windows:
C:\src>server
Listening for incoming connections...
C:\src>client
Client connected: 127.0.0.1:64395
Connection to server established. Enter message: msg
Client message: msg
Sending message...
Sleeping...
Exiting...
C:\src>
Calling recv()...
recv() failed
Case "no shutdown/closesocket" on Linux:
$ server
Listening for incoming connections...
$ client
Client connected: 127.0.0.1:33044
Connection to server established. Enter message: msg
Client message: msg
Sending message...
Sleeping...
Exiting...
$
Calling recv()...
Server message: MESSAGE
Case "shutdown/closesocket" on Windows:
C:\src>server shutdown closesocket
Listening for incoming connections...
C:\src>client
Client connected: 127.0.0.1:64396
Connection to server established. Enter message: msg
Client message: msg
Sending message...
Sleeping...
Calling shutdown()...
Calling closesocket()...
Exiting...
C:\src>
Calling recv()...
Server message: MESSAGE
That's okay so far, but what makes cases [1]/[3] different from all cases
in the whole existing test suite, which now performed successfully, is
that psql calls send() before recv() on a socket closed and abandoned by
the server.
Those programs show on Windows:
C:\src>server shutdown closesocket
Listening for incoming connections...
C:\src>client send_before_recv
Client connected: 127.0.0.1:64397
Connection to server established. Enter message: msg
Client message: msg
Sending message...
Sleeping...
Calling shutdown()...
Calling closesocket()...
Exiting...
C:\src>
send() returned 4
Calling recv()...
recv() failed
As known, on Linux the same scenario works just fine.
Fourth, tests 099_rare_ssl_failures (and 001_ssl_tests, though more rarely)
fail for me with the latest patches (only on Windows again):
...
8/10 postgresql:ssl_1 / ssl_1/099_rare_ssl_failures ERROR 141.34s exit status 3
...
9/10 postgresql:ssl_7 / ssl_7/099_rare_ssl_failures OK 142.52s 2000 subtests passed
10/10 postgresql:ssl_6 / ssl_6/099_rare_ssl_failures OK 143.00s 2000 subtests passed
Ok: 2
Expected Fail: 0
Fail: 8
ssl_1\099_rare_ssl_failures\log\regress_log_099_rare_ssl_failures.txt:
...
iteration 354
[20:57:06.984](0.106s) ok 707 - certificate authorization fails with revoked client cert with server-side CRL directory
[20:57:06.984](0.000s) ok 708 - certificate authorization fails with revoked client cert with server-side CRL directory:
matches
iteration 355
[20:57:07.156](0.172s) ok 709 - certificate authorization fails with revoked client cert with server-side CRL directory
[20:57:07.156](0.001s) not ok 710 - certificate authorization fails with revoked client cert with server-side CRL
directory: matches
[20:57:07.159](0.003s) # Failed test 'certificate authorization fails with revoked client cert with server-side CRL
directory: matches'
# at .../src/test/ssl_1/t/099_rare_ssl_failures.pl line 88.
[20:57:07.159](0.000s) # 'psql: error: connection to server at "127.0.0.1", port 59843 failed: could
not receive data from server: Software caused connection abort (0x00002745/10053)
# SSL SYSCALL error: Software caused connection abort (0x00002745/10053)'
# doesn't match '(?^:SSL error: sslv3 alert certificate revoked)'
...
It seems to me that it can have the same explanation (if openssl can call
send() before recv() under the hood), but maybe it should be investigated
further.
> Review/poking-with-a-stick/trying-to-break-it most welcome.
I could not find anything suspicious in the code, except for maybe a typo
"The are ...".
[1] https://www.postgresql.org/message-id/flat/87k1iy44fd.fsf%40news-spur.riddles.org.uk#ba0c07f13c300d42fd537855dd95dd2b
[2] https://www.postgresql.org/message-id/flat/CAEepm%3D2n6Nv%2B5tFfe8YnkUm1fXgvxR0Mm1FoD%2BQKG-vLNGLyKg%40mail.gmail.com
[3] https://www.postgresql.org/message-id/flat/E1iaD8h-0004us-K9(at)gemulon(dot)postgresql(dot)org
[4]
https://www.postgresql.org/message-id/flat/CALDaNm2tEvr_Kum7SyvFn0%3D6H3P0P-Zkhnd%3DdkkX%2BQ%3DwKutZ%3DA%40mail.gmail.com
[5] https://www.postgresql.org/message-id/flat/20200603201242.ofvm4jztpqytwfye%40alap3.anarazel.de
[6] https://www.postgresql.org/message-id/16678-253e48d34dc0c376@postgresql.org
[7] https://www.postgresql.org/message-id/flat/90b34057-4176-7bb0-0dbb-9822a5f6425b%40greiz-reinsdorf.de
[8] https://www.postgresql.org/message-id/flat/af5e0bf3-6a61-bb97-6cba-061ddf22ff6b%40dunslane.net
[9]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BOeoETZQ%3DQw5Ub5h3tmwQhBmDA%3DnuNO3KG%3DzWfUypFAw%40mail.gmail.com
[10] https://www.postgresql.org/message-id/flat/17391-304f81bcf724b58b%40postgresql.org
Best regards,
Alexander
Attachment | Content-Type | Size |
---|---|---|
walsender-cannot-exit.txt | text/plain | 2.2 KB |
server.c | text/x-csrc | 2.1 KB |
client.c | text/x-csrc | 1.7 KB |
win-sock-tests-01.patch | text/x-patch | 19.5 KB |
win-sock-tests-01.cmd.txt | text/plain | 12.0 KB |
From | Date | Subject | |
---|---|---|---|
Next Message | Drouvot, Bertrand | 2023-11-21 11:05:31 | Re: Synchronizing slots from primary to standby |
Previous Message | Hayato Kuroda (Fujitsu) | 2023-11-21 10:37:03 | RE: Random pg_upgrade test failure on drongo |