From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org> |
Subject: | Re: Recent 027_streaming_regress.pl hangs |
Date: | 2024-03-21 00:41:45 |
Message-ID: | 20240321004145.obrbakt4ugzcsphy@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >> Could it be that the timeout (360 sec?) is just not enough for the test
> >> under the current (changed due to switch to meson) conditions?
>
> > But you're right that under meson the test takes a lot longer, I guess
> > due to increased concurrency:
>
> What it seems to be is highly variable. Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds. I wonder what
> else is happening on that machine.
There's a lot of other animals on the same machine, however it's rarely fuly
loaded (with either CPU or IO).
I don't think the test just being slow is the issue here, e.g. in the last
failing iteration
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-20%2022%3A03%3A15
the tests completed
2024-03-20 22:07:50.239 UTC [3937667][client backend][22/3255:0] LOG: statement: DROP ROLE regress_tablespace_user2;
2024-03-20 22:07:50.251 UTC [3937667][client backend][:0] LOG: disconnection: session time: 0:00:12.431 user=bf database=regression host=[local]
and we waited to replicate for quite a while:
2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG: connection authorized: user=bf database=postgres application_name=027_stream_regress.pl
2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG: statement: SELECT '0/15BA21B0' <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('standby_1', 'walreceiver')
2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG: disconnection: session time: 0:00:00.063 user=bf database=postgres host=[local]
2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG: received immediate shutdown request
2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG: database system is shut down
There was no activity for 7 minutes.
System statistics show relatively low load CPU and IO load for the period from
22:00 - 22:10.
I suspect we have some more fundamental instability at our hands, there have
been failures like this going back a while, and on various machines.
I think at the very least we should make Cluster.pm's wait_for_catchup() print
some information when it times out - right now it's neigh on undebuggable,
because we don't even log what we were waiting for and what the actual
replication position was.
> Also, this is probably not
> helping anything:
>
> 'extra_config' => {
> ...
> 'fsync = on'
At some point we had practically no test coverage of fsync, so I made my
animals use fsync. I think we still have little coverage. I probably could
reduce the number of animals using it though.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | David Rowley | 2024-03-21 00:45:06 | Re: Flushing large data immediately in pqcomm |
Previous Message | David Rowley | 2024-03-21 00:28:16 | Re: Popcount optimization using AVX512 |