The 031_recovery_conflict.pl test might fail due to late pgstat entries flushing

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>
Subject: The 031_recovery_conflict.pl test might fail due to late pgstat entries flushing
Date: 2024-07-24 03:00:00
Message-ID: 1eb7832d-3a26-ae53-6017-d0864ea908aa@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello hackers,

A recent buildfarm failure [1] shows that the test 031_recovery_conflict.pl
can fail yet another way:
 23/296 postgresql:recovery / recovery/031_recovery_conflict ERROR            11.55s   exit status 1

[07:58:53.979](0.255s) ok 11 - tablespace conflict: logfile contains terminated connection due to recovery conflict
[07:58:54.058](0.080s) not ok 12 - tablespace conflict: stats show conflict on standby
[07:58:54.059](0.000s) #   Failed test 'tablespace conflict: stats show conflict on standby'
#   at /home/bf/bf-build/rorqual/REL_17_STABLE/pgsql/src/test/recovery/t/031_recovery_conflict.pl line 332.
[07:58:54.059](0.000s) #          got: '0'
#     expected: '1'

I managed to reproduce a similar failure by running multiple test instances
in parallel on a slow VM. With extra logging added, I see in a failed run
log:
10
10      #   Failed test 'startup deadlock: stats show conflict on standby'
10      #   at t/031_recovery_conflict.pl line 368.
10      #          got: '0'
10      #     expected: '1'
10      [19:48:19] t/031_recovery_conflict.pl ..
10      Dubious, test returned 1 (wstat 256, 0x100)

2024-07-23 19:48:13.966 UTC [1668402:12][client backend][1/2:0] LOG:  !!!pgstat_report_recovery_conflict| reason: 13,
pgstat_track_counts: 1 at character 15
2024-07-23 19:48:13.966 UTC [1668402:13][client backend][1/2:0] STATEMENT:  SELECT * FROM test_recovery_conflict_table2;
2024-07-23 19:48:13.966 UTC [1668402:14][client backend][1/2:0] ERROR:  canceling statement due to conflict with
recovery at character 15
2024-07-23 19:48:13.966 UTC [1668402:15][client backend][1/2:0] DETAIL:  User transaction caused buffer deadlock with
recovery.
...
2024-07-23 19:48:14.129 UTC [1668805:8][client backend][5/2:0] LOG: statement: SELECT confl_deadlock FROM
pg_stat_database_conflicts WHERE datname='test_db';
...
2024-07-23 19:48:14.148 UTC [1668402:16][client backend][1/0:0] LOG:  !!!pgstat_database_flush_cb| nowait: 0

This failure can be reproduced easily with a sleep added as below:
@@ -514,6 +514,7 @@ pgstat_shutdown_hook(int code, Datum arg)
     if (OidIsValid(MyDatabaseId))
         pgstat_report_disconnect(MyDatabaseId);

+if (rand() % 5 == 0) pg_usleep(100000);
     pgstat_report_stat(true);

By running the test in a loop, I get miscellaneous
"stats show conflict on standby" failures, including:
iteration 19
# +++ tap check in src/test/recovery +++
t/031_recovery_conflict.pl .. 1/?
#   Failed test 'buffer pin conflict: stats show conflict on standby'
#   at t/031_recovery_conflict.pl line 332.
#          got: '0'
#     expected: '1'
t/031_recovery_conflict.pl .. 17/? # Looks like you failed 1 test of 18.

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=rorqual&dt=2024-07-23%2007%3A56%3A35

Best regards,
Alexander

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2024-07-24 03:20:50 Re: Remove dependence on integer wrapping
Previous Message Jeff Davis 2024-07-24 02:26:29 Re: [18] Policy on IMMUTABLE functions and Unicode updates