From: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
---|---|
To: | pgsql-hackers(at)lists(dot)postgresql(dot)org |
Cc: | Andres Freund <andres(at)anarazel(dot)de>, Michael Paquier <michael(at)paquier(dot)xyz> |
Subject: | Re: New instability in stats regression test |
Date: | 2023-11-25 19:34:40 |
Message-ID: | 3510625.1700940880@sss.pgh.pa.us |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
I wrote:
> I'm a bit mystified by this. This test was introduced in Andres'
> commit 10a082bf7 of 2023-02-11, and it seems to have been stable
> since then. I trawled the buildfarm logs going back three months
> and found no similar failures. So why's it failing now? The
> most plausible theory seems to be that Michael's recent commits
> adding pg_stat_reset_xxx features destabilized the test somehow ...
> but I sure don't see how/why.
After a bit more looking around, I have part of a theory.
Commit 23c8c0c8f of 2023-11-12 added this, a little ways before
the problematic test:
-- Test that reset_shared with no argument resets all the stats types
-- supported (providing NULL as argument has the same effect).
SELECT pg_stat_reset_shared();
The test that is failing is of course
-- Test IO stats reset
SELECT pg_stat_have_stats('io', 0, 0);
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_pre_reset
FROM pg_stat_io \gset
SELECT pg_stat_reset_shared('io');
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
FROM pg_stat_io \gset
SELECT :io_stats_post_reset < :io_stats_pre_reset;
So the observed failure could be explained if, between the
"pg_stat_reset_shared('io')" call and the subsequent scan of
pg_stat_io, concurrent sessions had done more I/O operations
than happened since that new pg_stat_reset_shared() call.
Previously, the "pre_reset" counts would be large enough to
make that a pretty ridiculous theory, but after 23c8c0c8f maybe
it's not.
To test this idea, I made the test print out the actual values
of the counts, like this:
@@ -1585,10 +1585,10 @@
SELECT sum(evictions) + sum(reuses) + sum(extends) + sum(fsyncs) + sum(reads) + sum(writes) + sum(writebacks) + sum(hits) AS io_stats_post_reset
FROM pg_stat_io \gset
-SELECT :io_stats_post_reset < :io_stats_pre_reset;
- ?column?
-----------
- t
+SELECT :io_stats_post_reset, :io_stats_pre_reset;
+ ?column? | ?column?
+----------+----------
+ 10452 | 190087
(1 row)
Of course, this makes it fail every time, but the idea is to get
a sense of the magnitude of the counts; and what I'm seeing is
that the "pre reset" counts are typically 10x more than the
"post reset" ones, even after 23c8c0c8f. If I remove the
suspicious pg_stat_reset_shared() call, there's about 3 orders
of magnitude difference; but still you'd think a 10x safety
margin would be enough. So this theory doesn't seem to quite
work as-is. Perhaps there's some additional contributing factor
I didn't think to control.
Nonetheless, it seems like a really bad idea that this test
of I/O stats reset happens after the newly-added test. It
is clearly now dependent on timing and the amount of concurrent
activity whether it will pass or not. We should probably
re-order the tests to do the old test first; or else abandon
this test methodology and just test I/O reset the same way
we test the other cases (checking only for timestamp advance).
Or maybe we don't really need the pg_stat_reset_shared() test?
regards, tom lane
From | Date | Subject | |
---|---|---|---|
Next Message | Alexander Lakhin | 2023-11-25 20:00:01 | Test 002_pg_upgrade fails with olddump on Windows |
Previous Message | Davin Shearer | 2023-11-25 19:21:37 | Emitting JSON to file using COPY TO |