From: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com> |
---|---|
To: | Andres Freund <andres(at)anarazel(dot)de> |
Cc: | Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi> |
Subject: | Re: Flaky 003_start_stop.pl test |
Date: | 2025-03-03 20:48:45 |
Message-ID: | CA+hUKGKJSOAdAukP4QTkR3-jFws39+8C197XC-a97dgYr=cdBA@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote:
> > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256
>
> A possibly related test failure is:
>
> https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259
They look different to me:
(1) In Jelte's report for 003_start_stop, we see:
[03:40:42.347](2.022s) ok 1 - dead-end connection 0
[03:40:42.348](0.001s) ok 2 - dead-end connection 1
[03:40:42.349](0.001s) ok 3 - dead-end connection 2
... up to 20 ...
We see in the server log that the first one had pid 31660 and slot 1,
but it never said "SSLRequest rejected" like all the others. The Perl
code must have received "N", or it would not have moved onto opening
the next connection. Then just as it is about to try to connect,
hoping to see "too many clients", that one exits, the "invalid_role"
connection gets slot 1 instead of the dead-end backend we expected:
03:40:42.375 UTC [31645][postmaster] DEBUG: client backend (PID
31660) exited with exit code 0 <-- goodbye, silent backend
03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
connection received: host=[local]
03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
SSLRequest rejected
03:40:42.511 UTC [31645][postmaster] DEBUG: assigned pm child slot 1
for backend <-- we get silent backend's slot
03:40:42.512 UTC [31645][postmaster] DEBUG: forked new client
backend, pid=31690 socket=7
03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG:
connection received: host=[local]
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
connection authenticated: user="invalid_user" method=trust
(/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117)
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
connection authorized: user=invalid_user database=postgres
application_name=003_start_stop.pl
03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL:
role "invalid_user" does not exist
Could the backend have sent an "N" (NoticeResponse) unrelated to SSL
negotiation?
(2) In Andres's report for 002_connection_limits, we see that the
server logged "FATAL: remaining connection slots are reserved for
roles with the SUPERUSER attribute" but the test saw "server closed
the connection unexpectedly", on Windows. That's the socket/exit
phenomenon of the simple type where the clients next call is
recv()[1][2]. If there are no objections I think I will go ahead and
re-commit 6051857fc, as threatened in a8458f50, to master only. And
think about how to collect CI/BF stats on this specific failure
pattern (which happens in various tests with various FATAL messages)
to measure the effects.
[1] https://wiki.postgresql.org/wiki/Known_Buildfarm_Test_Failures#Miscellaneous_tests_fail_on_Windows_due_to_a_connection_closed_before_receiving_a_final_error_message
[2] https://www.postgresql.org/message-id/flat/CA%2BhUKGLR10ZqRCvdoRrkQusq75wF5%3DvEetRSs2_u1s%2BFAUosFQ%40mail.gmail.com
From | Date | Subject | |
---|---|---|---|
Next Message | Andres Freund | 2025-03-03 20:50:09 | Re: Ubsan complaint on kestrel |
Previous Message | Tomas Vondra | 2025-03-03 20:31:42 | Re: scalability bottlenecks with (many) partitions (and more) |