Re: Flaky 003_start_stop.pl test

From: Andres Freund <andres(at)anarazel(dot)de>
To: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>
Cc: Jelte Fennema-Nio <postgres(at)jeltef(dot)nl>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>
Subject: Re: Flaky 003_start_stop.pl test
Date: 2025-03-05 00:41:38
Message-ID: rm3wawhaebopcncwwadnenypl7mwwqacoqj2u35qi36d5zcms2@wbbcb6t6ukf4
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2025-03-04 09:48:45 +1300, Thomas Munro wrote:
> On Tue, Mar 4, 2025 at 5:48 AM Andres Freund <andres(at)anarazel(dot)de> wrote:
> > On 2025-02-24 11:26:56 +0100, Jelte Fennema-Nio wrote:
> > > [1]: https://cirrus-ci.com/task/5571017969500160?logs=test_world#L256
> >
> > A possibly related test failure is:
> >
> > https://cirrus-ci.com/task/5005337046024192?logs=check_world#L259
>
> They look different to me:

Thanks for checking!

> (1) In Jelte's report for 003_start_stop, we see:
>
> [03:40:42.347](2.022s) ok 1 - dead-end connection 0
> [03:40:42.348](0.001s) ok 2 - dead-end connection 1
> [03:40:42.349](0.001s) ok 3 - dead-end connection 2
> ... up to 20 ...
>
> We see in the server log that the first one had pid 31660 and slot 1,
> but it never said "SSLRequest rejected" like all the others. The Perl
> code must have received "N", or it would not have moved onto opening
> the next connection. Then just as it is about to try to connect,
> hoping to see "too many clients", that one exits, the "invalid_role"
> connection gets slot 1 instead of the dead-end backend we expected:
>
> 03:40:42.375 UTC [31645][postmaster] DEBUG: client backend (PID
> 31660) exited with exit code 0 <-- goodbye, silent backend
> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
> connection received: host=[local]
> 03:40:42.480 UTC [31681][not initialized] [[unknown]][:0] LOG:
> SSLRequest rejected
> 03:40:42.511 UTC [31645][postmaster] DEBUG: assigned pm child slot 1
> for backend <-- we get silent backend's slot
> 03:40:42.512 UTC [31645][postmaster] DEBUG: forked new client
> backend, pid=31690 socket=7
> 03:40:42.542 UTC [31690][not initialized] [[unknown]][:0] LOG:
> connection received: host=[local]
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
> connection authenticated: user="invalid_user" method=trust
> (/tmp/cirrus-ci-build/build/testrun/postmaster/003_start_stop/data/t_003_start_stop_main_data/pgdata/pg_hba.conf:117)
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] LOG:
> connection authorized: user=invalid_user database=postgres
> application_name=003_start_stop.pl
> 03:40:42.543 UTC [31690][client backend] [[unknown]][0/1:0] FATAL:
> role "invalid_user" does not exist

Good catch.

The pattern is similar in other failed tests. There has been a spate of errors
on postgres/postgres' CI, which made me look at this again.

> Could the backend have sent an "N" (NoticeResponse) unrelated to SSL
> negotiation?

I don't think that'd be legal protocol flow from the server side. And, given
that the test uses log_min_messages=debug2, should have been logged in the
server log.

All the recent failures were on FreeBSD, interestingly.

I wonder if we should, at least temporarily, increase the test's log level?

I also suspect that it'd be worthwhile to add a few more if
(Trace_connection_negotiation) messages.

Greetings,

Andres Freund

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Jacob Champion 2025-03-05 00:53:14 Re: [PATCH] pg_stat_activity: make slow/hanging authentication more visible
Previous Message Euler Taveira 2025-03-05 00:33:39 Re: log_min_messages per backend type