Re: New instability in stats regression test

From: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Cc: Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz>
Subject: Re: New instability in stats regression test
Date: 2023-11-25 19:34:40
Message-ID: 3510625.1700940880@sss.pgh.pa.us
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

I wrote:
> I'm a bit mystified by this. This test was introduced in Andres'
> commit 10a082bf7 of 2023-02-11, and it seems to have been stable
> since then. I trawled the buildfarm logs going back three months
> and found no similar failures. So why's it failing now? The
> most plausible theory seems to be that Michael's recent commits
> adding pg_stat_reset_xxx features destabilized the test somehow ...
> but I sure don't see how/why.

After a bit more looking around, I have part of a theory.
Commit 23c8c0c8f of 2023-11-12 added this, a little ways before
the problematic test:

-- Test that reset_shared with no argument resets all the stats types
-- supported (providing NULL as argument has the same effect).
SELECT pg_stat_reset_shared();

The test that is failing is of course

-- Test IO stats reset
SELECT pg_stat_have_stats('io', 0, 0);
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
FROM pg_stat_io \gset
SELECT pg_stat_reset_shared('io');
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
FROM pg_stat_io \gset
SELECT :io_stats_post_reset < :io_stats_pre_reset;

So the observed failure could be explained if, between the
"pg_stat_reset_shared('io')" call and the subsequent scan of
pg_stat_io, concurrent sessions had done more I/O operations
than happened since that new pg_stat_reset_shared() call.
Previously, the "pre_reset" counts would be large enough to
make that a pretty ridiculous theory, but after 23c8c0c8f maybe
it's not.

To test this idea, I made the test print out the actual values
of the counts, like this:

@@ -1585,10 +1585,10 @@

SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
FROM pg_stat_io \gset
-SELECT :io_stats_post_reset < :io_stats_pre_reset;
- ?column?
-----------
- t
+SELECT :io_stats_post_reset, :io_stats_pre_reset;
+ ?column? | ?column?
+----------+----------
+ 10452 | 190087
(1 row)

Of course, this makes it fail every time, but the idea is to get
a sense of the magnitude of the counts; and what I'm seeing is
that the "pre reset" counts are typically 10x more than the
"post reset" ones, even after 23c8c0c8f. If I remove the
suspicious pg_stat_reset_shared() call, there's about 3 orders
of magnitude difference; but still you'd think a 10x safety
margin would be enough. So this theory doesn't seem to quite
work as-is. Perhaps there's some additional contributing factor
I didn't think to control.

Nonetheless, it seems like a really bad idea that this test
of I/O stats reset happens after the newly-added test. It
is clearly now dependent on timing and the amount of concurrent
activity whether it will pass or not. We should probably
re-order the tests to do the old test first; or else abandon
this test methodology and just test I/O reset the same way
we test the other cases (checking only for timestamp advance).
Or maybe we don't really need the pg_stat_reset_shared() test?

regards, tom lane

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Alexander Lakhin 2023-11-25 20:00:01 Test 002_pg_upgrade fails with olddump on Windows
Previous Message Davin Shearer 2023-11-25 19:21:37 Emitting JSON to file using COPY TO