Re: stats test intermittent failure

From: Alexander Lakhin <exclusion(at)gmail(dot)com>
To: Melanie Plageman <melanieplageman(at)gmail(dot)com>, Pg Hackers <pgsql-hackers(at)postgresql(dot)org>
Cc: Jeff Davis <pgsql(at)j-davis(dot)com>, Andres Freund <andres(at)anarazel(dot)de>
Subject: Re: stats test intermittent failure
Date: 2023-07-11 18:00:00
Message-ID: 8d526f29-c269-dc3d-38ac-10fa3a0a7fb3@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi Melanie,

10.07.2023 21:35, Melanie Plageman wrote:
> Hi,
>
> Jeff pointed out that one of the pg_stat_io tests has failed a few times
> over the past months (here on morepork [1] and more recently here on
> francolin [2]).
>
> Failing test diff for those who prefer not to scroll:
>
> +++ /home/bf/bf-build/francolin/HEAD/pgsql.build/testrun/recovery/027_stream_regress/data/results/stats.out
> 2023-07-07 18:48:25.976313231 +0000
> @@ -1415,7 +1415,7 @@
> :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
> ?column? | ?column?
> ----------+----------
> - t | t
> + t | f
>
> My theory about the test failure is that, when there is enough demand
> for shared buffers, the flapping test fails because it expects buffer
> access strategy *reuses* and concurrent queries already flushed those
> buffers before they could be reused. Attached is a patch which I think
> will fix the test while keeping some code coverage. If we count
> evictions and reuses together, those should have increased.

I managed to reproduce that failure with the attached patch applied
(on master) and with the following script (that effectively multiplies
probability of the failure by 360):
CPPFLAGS="-O0" ./configure -q --enable-debug --enable-cassert --enable-tap-tests  && make  -s -j`nproc` && make -s check
-C src/test/recovery
mkdir -p src/test/recovery00/t
cp src/test/recovery/t/027_stream_regress.pl src/test/recovery00/t/
cp src/test/recovery/Makefile src/test/recovery00/
for ((i=1;i<=9;i++)); do cp -r src/test/recovery00/ src/test/recovery$i; done

for ((i=1;i<=10;i++)); do echo "iteration $i"; NO_TEMP_INSTALL=1 parallel --halt now,fail=1 -j9 --linebuffer --tag make
-s check -C src/test/{} ::: recovery1 recovery2 recovery3 recovery4 recovery5 recovery6 recovery7 recovery8 recovery9 ||
break; done

Without your patch, I get:
iteration 2
...
recovery5       #   Failed test 'regression tests pass'
recovery5       #   at t/027_stream_regress.pl line 92.
recovery5       #          got: '256'
recovery5       #     expected: '0'
...
src/test/recovery5/tmp_check/log/regress_log_027_stream_regress contains:
--- .../src/test/regress/expected/stats.out  2023-07-11 20:05:10.536059706 +0300
+++ .../src/test/recovery5/tmp_check/results/stats.out 2023-07-11 20:30:46.790551305 +0300
@@ -1418,7 +1418,7 @@
        :io_sum_vac_strategy_after_reuses > :io_sum_vac_strategy_before_reuses;
  ?column? | ?column?
 ----------+----------
- t        | t
+ t        | f
 (1 row)

With your patch applied, 10 iterations performed successfully for me.
So it looks like your theory and your fix are correct.

Best regards,
Alexander

Attachment Content-Type Size
regress-repeat-stats-line.patch text/x-patch 18.2 KB

In response to

Browse pgsql-hackers by date

  From Date Subject
Next Message Alena Rybakina 2023-07-11 18:11:31 Re: POC, WIP: OR-clause support for indexes
Previous Message Jacob Champion 2023-07-11 17:50:28 Re: [PoC] Federated Authn/z with OAUTHBEARER