Re: Recent 027_streaming_regress.pl hangs

From: Andres Freund <andres(at)anarazel(dot)de>
To: Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>
Cc: Thomas Munro <thomas(dot)munro(at)gmail(dot)com>, Alexander Lakhin <exclusion(at)gmail(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, pgsql-hackers <pgsql-hackers(at)postgresql(dot)org>, Bharath Rupireddy <bharath(dot)rupireddyforpostgres(at)gmail(dot)com>
Subject: Re: Recent 027_streaming_regress.pl hangs
Date: 2024-03-26 03:56:07
Message-ID: 20240326035607.grqoyrxjvpyhnkrf@awork3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

On 2024-03-20 17:41:45 -0700, Andres Freund wrote:
> On 2024-03-14 16:56:39 -0400, Tom Lane wrote:
> > Also, this is probably not
> > helping anything:
> >
> > 'extra_config' => {
> > ...
> > 'fsync = on'
>
> At some point we had practically no test coverage of fsync, so I made my
> animals use fsync. I think we still have little coverage. I probably could
> reduce the number of animals using it though.

I think there must be some actual regression involved. The frequency of
failures on HEAD vs failures on 16 - both of which run the tests concurrently
via meson - is just vastly different. I'd expect the absolute number of
failures in 027_stream_regress.pl to differ between branches due to fewer runs
on 16, but there's no explanation for the difference in percentage of
failures. My menagerie had only a single recoveryCheck failure on !HEAD in the
last 30 days, but in the vicinity of 100 on HEAD
https://buildfarm.postgresql.org/cgi-bin/show_failures.pl?max_days=30&stage=recoveryCheck&filter=Submit

If anything the load when testing back branch changes is higher, because
commonly back-branch builds are happening on all branches, so I don't think
that can be the explanation either.

From what I can tell the pattern changed on 2024-02-16 19:39:02 - there was a
rash of recoveryCheck failures in the days before that too, but not
027_stream_regress.pl in that way.

It certainly seems suspicious that one commit before the first observed failure
is
2024-02-16 11:09:11 -0800 [73f0a132660] Pass correct count to WALRead().

Of course the failure rate is low enough that it could have been a day or two
before that, too.

Greetings,

Andres Freund

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message shveta malik 2024-03-26 04:00:32 Re: Introduce XID age and inactive timeout based replication slot invalidation
Previous Message Tom Lane 2024-03-26 03:45:42 Re: Teach predtest about IS [NOT] <boolean> proofs