Re: Add isolation test template in injection_points for wait/wakeup/detach

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Bertrand Drouvot <bertranddrouvot(dot)pg(at)gmail(dot)com>
Cc: Postgres hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Add isolation test template in injection_points for wait/wakeup/detach
Date: 2024-10-25 00:33:12
Message-ID: ZxrnSGdNtQWAxE3_@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Mon, Oct 21, 2024 at 11:13:59AM +0900, Michael Paquier wrote:
> Used "basic" as test name at the end, tweaked the comment close to
> what you have suggested, and applied the result.

I have spotted a race condition with this test in some CF bot runs.
Here are some of them:
https://github.com/michaelpq/postgres/runs/31984161747
https://cirrus-ci.com/task/6741385749987328
https://cirrus-ci.com/task/6056918689513472
https://cirrus-ci.com/task/5863397261049856
https://cirrus-ci.com/task/5565235765968896
https://cirrus-ci.com/task/6238594598174720

All of them show up as follows in regression.diffs:
diff -U3
/tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out
/tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out
---
/tmp/cirrus-ci-build/src/test/modules/injection_points/expected/basic.out
2024-10-24 02:36:24.068590000 +0000
+++
/tmp/cirrus-ci-build/build/testrun/injection_points/isolation/results/basic.out
2024-10-24 02:40:09.179695000 +0000
@@ -13,16 +13,16 @@

(1 row)

-step wait1: <... completed>
-injection_points_run
---------------------
-
-(1 row)
-
step detach2: SELECT injection_points_detach('injection-points-wait');
injection_points_detach
-----------------------

+(1 row)
+
+step wait1: <... completed>
+injection_points_run
+--------------------
+
(1 row)

This is in the first permutation of the test done with "wait1 wakeup2
detach2", and the diff means that the backend running the "wait"
callback is reported as finished after the detach is done,
injection_points_run being only used for the waits. Hence the wait is
so slow to finish that the detach has time to complete and finish,
breaking the output.

And here comes the puzzling part: all of failures involve FreeBSD 13
in the CI. Reproducing this failure would not be difficult, I thought
first; we can add a hardcoded pg_usleep() to delay the end of
injection_wait() so as we make sure that the backend doing the wait
reports later than the detach. Or just do the same at the end of
injection_points_run() once the callback exits. I've sure done that,
placing some strategic pg_usleep() calls on Linux to make the paths
that matter in the wait slower, but the test remains stable. The CI
on Linux is stable as well: 3rd and 4th columns of the cfbot are
green, I did not spot any failures related to this isolation test in
injection_points. Only the second column about FreeBSD is going rogue
on a periodic basis.

One fix would is to remove this first permutation test, still that's
hiding a problem rather than solving it, and something looks wrong
with conditional variables, specific to FreeBSD? Hence, I'm puzzled
by that.

I am going to remove the first permutation anyway to stabilize the CI
reports. But it feels like we have a different problem here, and I am
not sure what.

Any thoughts or comments?
--
Michael

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Melanie Plageman 2024-10-25 00:33:29 Docs Build in CI failing with "failed to load external entity"
Previous Message Paul Jungwirth 2024-10-25 00:22:32 Re: SQL:2011 application time