Re: Recent 027_streaming_regress.pl hangs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: Re: Recent 027_streaming_regress.pl hangs
Date: 2024-03-21 00:41:45
Message-ID: 20240321004145.obrbakt4ugzcsphy@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> Thomas Munro <thomas(dot)munro(at)gmail(dot)com> writes:
> > On Fri, Mar 15, 2024 at 7:00 AM Alexander Lakhin <exclusion(at)gmail(dot)com> wrote:
> >> Could it be that the timeout (360 sec?) is just not enough for the test
> >> under the current (changed due to switch to meson) conditions?
>
> > But you're right that under meson the test takes a lot longer, I guess
> > due to increased concurrency:
>
> What it seems to be is highly variable. Looking at calliphoridae's
> last half dozen successful runs, I see reported times for
> 027_stream_regress anywhere from 183 to 704 seconds. I wonder what
> else is happening on that machine.

There's a lot of other animals on the same machine, however it's rarely fuly
loaded (with either CPU or IO).

I don't think the test just being slow is the issue here, e.g. in the last
failing iteration
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=calliphoridae&dt=2024-03-20%2022%3A03%3A15

the tests completed

2024-03-20 22:07:50.239 UTC [3937667][client backend][22/3255:0] LOG: statement: DROP ROLE regress_tablespace_user2;
2024-03-20 22:07:50.251 UTC [3937667][client backend][:0] LOG: disconnection: session time: 0:00:12.431 user=bf database=regression host=[local]

and we waited to replicate for quite a while:

2024-03-20 22:14:01.904 UTC [56343][client backend][6/1925:0] LOG: connection authorized: user=bf database=postgres application_name=027_stream_regress.pl
2024-03-20 22:14:01.930 UTC [56343][client backend][6/1926:0] LOG: statement: SELECT '0/15BA21B0' <= replay_lsn AND state = 'streaming'
FROM pg_catalog.pg_stat_replication
WHERE application_name IN ('standby_1', 'walreceiver')
2024-03-20 22:14:01.958 UTC [56343][client backend][:0] LOG: disconnection: session time: 0:00:00.063 user=bf database=postgres host=[local]
2024-03-20 22:14:02.083 UTC [3729516][postmaster][:0] LOG: received immediate shutdown request
2024-03-20 22:14:04.970 UTC [3729516][postmaster][:0] LOG: database system is shut down

There was no activity for 7 minutes.

System statistics show relatively low load CPU and IO load for the period from
22:00 - 22:10.

I suspect we have some more fundamental instability at our hands, there have
been failures like this going back a while, and on various machines.

I think at the very least we should make Cluster.pm's wait_for_catchup() print
some information when it times out - right now it's neigh on undebuggable,
because we don't even log what we were waiting for and what the actual
replication position was.

> Also, this is probably not
> helping anything:
>
> 'extra_config' => {
> ...
> 'fsync = on'

At some point we had practically no test coverage of fsync, so I made my
animals use fsync. I think we still have little coverage. I probably could
reduce the number of animals using it though.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message David Rowley 2024-03-21 00:45:06 Re: Flushing large data immediately in pqcomm
Previous Message David Rowley 2024-03-21 00:28:16 Re: Popcount optimization using AVX512