From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us> |
Cc: | Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com> |
Subject: | Re: Recent 027_streaming_regress.pl hangs |
Date: | 2024-03-26 03:56:07 |
Message-ID: | 20240326035607.grqoyrxjvpyhnkrf@awork3.anarazel.de |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> > Also, this is probably not
> > helping anything:
> >
> > 'extra_config' => {
> > ...
> > 'fsync = on'
>
> At some point we had practically no test coverage of fsync, so I made my
> animals use fsync. I think we still have little coverage. I probably could
> reduce the number of animals using it though.
I think there must be some actual regression involved. The frequency of
failures on HEAD vs failures on 16 - both of which run the tests concurrently
via meson - is just vastly different. I'd expect the absolute number of
failures in 027_stream_regress.pl to differ between branches due to fewer runs
on 16, but there's no explanation for the difference in percentage of
failures. My menagerie had only a single recoveryCheck failure on !HEAD in the
last 30 days, but in the vicinity of 100 on HEAD
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=30&stage=recoveryCheck&filter=Submit
If anything the load when testing back branch changes is higher, because
commonly back-branch builds are happening on all branches, so I don't think
that can be the explanation either.
From what I can tell the pattern changed on 2024-02-16 19:39:02 - there was a
rash of recoveryCheck failures in the days before that too, but not
027_stream_regress.pl in that way.
It certainly seems suspicious that one commit before the first observed failure
is
2024-02-16 11:09:11 -0800 [73f0a132660] Pass correct count to WALRead().
Of course the failure rate is low enough that it could have been a day or two
before that, too.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | shveta malik | 2024-03-26 04:00:32 | Re: Introduce XID age and inactive timeout based replication slot invalidation |
Previous Message | Tom Lane | 2024-03-26 03:45:42 | Re: Teach predtest about IS [NOT] <boolean> proofs |