fairywren timeout failures on the pg_amcheck/005_opclass_damage test

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Andrew Dunstan <andrew(at)dunslane(dot)net>
Subject: fairywren timeout failures on the pg_amcheck/005_opclass_damage test
Date: 2024-07-25 12:00:00
Message-ID: b2037a8d-fe6b-d299-da17-ff5f3214e648@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

Please take a look at today's failure [1], that raises several questions
at once:
117/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT        3001.48s   exit status 1
180/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage TIMEOUT        3001.43s   exit status 1

Ok:                 227
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            15
Timeout:            2

But looking at the previous test run [2], marked 'OK', I can see almost
the same:
115/244 postgresql:pg_upgrade / pg_upgrade/002_pg_upgrade TIMEOUT        3001.54s   exit status 1
176/244 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage TIMEOUT        3001.26s   exit status 1

Ok:                 227
Expected Fail:      0
Fail:               0
Unexpected Pass:    0
Skipped:            15
Timeout:            2

So it's not clear to me, why is the latter test run considered failed
unlike the former?

As to the 005_opclass_damage failure itself, we can find successful test
runs with duration close to 3000s:
[3] 212/242 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK             2894.75s   10 subtests passed
[4] 212/242 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK             2941.73s   10 subtests passed

(The average duration across 35 successful runs in June was around 1300s;
but in July only 5 test runs were successful.)

The other question is: why is 005_opclass_damage taking so much time there?
Looking at the last three runs in REL_17_STABLE, I see:
 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK               22.80s   10 subtests passed
 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK               19.60s   10 subtests passed
 87/243 postgresql:pg_amcheck / pg_amcheck/005_opclass_damage OK                6.09s   10 subtests passed

I suppose, the most significant factor here is
'HEAD' => [
    'debug_parallel_query = regress'
]
in the buildfarm client's config.

Indeed, it affects timing of the test very much, judging by what I'm
seeing locally, on Linux (I guess, process-related overhead might be
higher on Windows):
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:11:53] t/005_opclass_damage.pl .. ok     1370 ms ( 0.00 usr 0.00 sys +  0.10 cusr  0.07 csys =  0.17 CPU)

$ echo "debug_parallel_query = regress" >/tmp/extra.config
$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:12:46] t/005_opclass_damage.pl .. ok    40854 ms ( 0.00 usr 0.00 sys +  0.10 cusr  0.10 csys =  0.20 CPU)

(Thus we can see 30x duration increase.)

It's worth to note that such increase is rather new (introduced by
5ae208720); in REL_16_STABLE I'm seeing:
$ make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:18:52] t/005_opclass_damage.pl .. ok     1453 ms ( 0.00 usr 0.00 sys +  0.82 cusr  0.11 csys =  0.93 CPU)

$ TEMP_CONFIG=/tmp/extra.config make -s check -C src/bin/pg_amcheck/ PROVE_TESTS="t/005*" PROVE_FLAGS="--timer"
[11:19:18] t/005_opclass_damage.pl .. ok     8032 ms ( 0.00 usr 0.00 sys +  0.82 cusr  0.11 csys =  0.93 CPU)

So maybe at least this test should be improved for testing with
debug_parallel_query enabled, if such active use of parallel workers by
pg_amcheck can't be an issue to end users?

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-25%2003%3A23%3A19
[2] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-07-23%2023%3A01%3A55
[3] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-05-02%2019%3A03%3A08
[4] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=fairywren&dt=2024-04-30%2014%3A03%3A06

Best regards,
Alexander

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andrew Dunstan 2024-07-25 12:12:38 Re: fairywren timeout failures on the pg_amcheck/005_opclass_damage test
Previous Message px shi 2024-07-25 11:48:01 CREATE MATERIALIZED VIEW