Re: Recent 027_streaming_regress.pl hangs

From: Andrew Dunstan <andrew(at)dunslane(dot)net>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Alexander Lakhin <exclusion(at)gmail(dot)com>, Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andres Freund <andres(at)anarazel(dot)de>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Subject: Re: Recent 027_streaming_regress.pl hangs
Date: 2024-07-25 22:33:19
Message-ID: 3d35a43c-2439-4840-9066-0a661efcf1f7@dunslane.net
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers


On 2024-07-25 Th 5:14 PM, Tom Lane wrote:
> I wrote:
>> I'm confused by crake's buildfarm logs. AFAICS it is not running
>> recovery-check at all in most of the runs; at least there is no
>> mention of that step, for example here:
>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2024-07-25%2013%3A27%3A02
> Oh, I see it: the log file that is called recovery-check in a
> failing run is called misc-check if successful. That seems
> mighty bizarre, and it's not how my own animals behave.
> Something weird about the meson code path, perhaps?

Yes, it was discussed some time ago. As suggested by Andres, we run the
meson test suite more or less all together (in effect like "make
checkworld" but without the main regression suite, which is run on its
own first), rather than in the separate (and serialized) way we do with
the configure/make tests. That results in significant speedup. If the
tests fail we report the failure as happening at the first failure we
encounter, which is possibly less than ideal, but I haven't got a better
idea.

>
> Anyway, in this successful run:
>
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2018%3A57%3A02&stg=misc-check
>
> here are some salient test timings:
>
> 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.18s 9 subtests passed
> 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 15.95s 12 subtests passed
> 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 16.29s 14 subtests passed
> 17/297 postgresql:isolation / isolation/isolation OK 71.60s 119 subtests passed
> 41/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 169.13s 18 subtests passed
> 140/297 postgresql:initdb / initdb/001_initdb OK 41.34s 52 subtests passed
> 170/297 postgresql:recovery / recovery/027_stream_regress OK 469.49s 9 subtests passed
>
> while in the next, failing run
>
> https://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=crake&dt=2024-07-25%2020%3A18%3A05&stg=recovery-check
>
> the same tests took:
>
> 1/297 postgresql:pg_upgrade / pg_upgrade/001_basic OK 0.22s 9 subtests passed
> 2/297 postgresql:pg_upgrade / pg_upgrade/003_logical_slots OK 56.62s 12 subtests passed
> 3/297 postgresql:pg_upgrade / pg_upgrade/004_subscription OK 71.92s 14 subtests passed
> 21/297 postgresql:isolation / isolation/isolation OK 299.12s 119 subtests passed
> 31/297 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade OK 344.42s 18 subtests passed
> 159/297 postgresql:initdb / initdb/001_initdb OK 344.46s 52 subtests passed
> 162/297 postgresql:recovery / recovery/027_stream_regress ERROR 840.84s exit status 29
>
> Based on this, it seems fairly likely that crake is simply timing out
> as a consequence of intermittent heavy background activity.
>

The latest failure is this:

Waiting for replication conn standby_1's replay_lsn to pass 2/88E4E260 on primary
[16:40:29.481](208.545s) # poll_query_until timed out executing this query:
# SELECT '2/88E4E260' <= replay_lsn AND state = 'streaming'
# FROM pg_catalog.pg_stat_replication
# WHERE application_name IN ('standby_1', 'walreceiver')
# expecting this output:
# t
# last actual query output:
# f
# with stderr:
timed out waiting for catchup at /home/andrew/bf/root/HEAD/pgsql/src/test/recovery/t/027_stream_regress.pl line 103.

Maybe it's a case where the system is overloaded, I dunno. I wouldn't bet my house on it. Pretty much nothing else runs on this machine.

I have added a mild throttle to the buildfarm config so it doesn't try
to run every branch at once. Maybe I also need to bring down the number
or meson jobs too? But I suspect there's something deeper. Prior to the
failure of this test 10 days ago it hadn't failed in a very long time.
The OS was upgraded a month ago. Eight or so days ago I changed
PG_TEST_EXTRA. I can't think of anything else.

cheers

andrew

--
Andrew Dunstan
EDB:https://www.enterprisedb.com

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Matthias van de Meent 2024-07-25 23:28:40 Re: DRAFT: Pass sk_attno to consistent function
Previous Message Sami Imseih 2024-07-25 22:29:06 Re: Restart pg_usleep when interrupted