Re: Recent 027_streaming_regress.pl hangs

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Subject: Re: Recent 027_streaming_regress.pl hangs
Date: 2024-07-31 15:54:31
Message-ID: f61976a9-b045-42c0-9a5a-c142c2142641@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2024-07-25 Th 6:33 PM, Andrew Dunstan wrote:
>
>
> On 2024-07-25 Th 5:14 PM, Tom Lane wrote:
>> I wrote:
>>> I'm confused by crake's buildfarm logs. AFAICS it is not running
>>> recovery-check at all in most of the runs; at least there is no
>>> mention of that step, for example here:
>>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02
>> Oh, I see it: the log file that is called recovery-check in a
>> failing run is called misc-check if successful. That seems
>> mighty bizarre, and it's not how my own animals behave.
>> Something weird about the meson code path, perhaps?
>
>
> Yes, it was discussed some time ago. As suggested by Andres, we run
> the meson test suite more or less all together (in effect like "make
> checkworld" but without the main regression suite, which is run on its
> own first), rather than in the separate (and serialized) way we do
> with the configure/make tests. That results in significant speedup. If
> the tests fail we report the failure as happening at the first failure
> we encounter, which is possibly less than ideal, but I haven't got a
> better idea.
>
>
>> Anyway, in this successful run:
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check
>>
>> here are some salient test timings:
>>
>> 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.18s 9 subtests passed
>> 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 15.95s 12 subtests passed
>> 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 16.29s 14 subtests passed
>> 17/297 postgresql:isolation / isolation/isolation OK 71.60s 119 subtests passed
>> 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 169.13s 18 subtests passed
>> 140/297 postgresql:initdb / initdb/001_initdb OK 41.34s 52 subtests passed
>> 170/297 postgresql:recovery / recovery/027_stream_regress OK 469.49s 9 subtests passed
>>
>> while in the next, failing run
>>
>> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check
>>
>> the same tests took:
>>
>> 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.22s 9 subtests passed
>> 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 56.62s 12 subtests passed
>> 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 71.92s 14 subtests passed
>> 21/297 postgresql:isolation / isolation/isolation OK 299.12s 119 subtests passed
>> 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 344.42s 18 subtests passed
>> 159/297 postgresql:initdb / initdb/001_initdb OK 344.46s 52 subtests passed
>> 162/297 postgresql:recovery / recovery/027_stream_regress ERROR 840.84s exit status 29
>>
>> Based on this, it seems fairly likely that crake is simply timing out
>> as a consequence of intermittent heavy background activity.
>>
>
>
> The latest failure is this:
>
>
> Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on primary
> [16:40:29.481](208.545s) # poll_query_until timed out executing this query:
> # SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming'
> # FROM pg_catalog.pg_stat_replication
> # WHERE application_name IN ('standby_1', 'walreceiver')
> # expecting this output:
> # t
> # last actual query output:
> # f
> # with stderr:
> timed out waiting for catchup at /home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 103.
>
>
> Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my house on it. Pretty much nothing else runs on this machine.
>
> I have added a mild throttle to the buildfarm config so it doesn't try
> to run every branch at once. Maybe I also need to bring down the
> number or meson jobs too? But I suspect there's something deeper.
> Prior to the failure of this test 10 days ago it hadn't failed in a
> very long time. The OS was upgraded a month ago. Eight or so days ago
> I changed PG_TEST_EXTRA. I can't think of anything else.
>
>
>

There seem to be a bunch of recent failures, and not just on crake, and
not just on HEAD:
<https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=14&member=&stage=recoveryCheck&filter=Submit>

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Tom Lane 2024-07-31 16:05:34 Re: Recent 027_streaming_regress.pl hangs
Previous Message Jim Vanns 2024-07-31 15:41:51 Re: Suggestions to overcome 'multixact "members" limit exceeded' in temporary tables