From: | Noah Misch <noah(at)leadboat(dot)com> |
---|---|
To: | Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Andres Freund <andres(at)anarazel(dot)de>, Robert Pang <robertpang(at)google(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash |
Date: | 2025-04-13 15:33:12 |
Message-ID: | 20250413153312.12.nmisch@google.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Sat, Apr 05, 2025 at 07:09:58PM -0700, Noah Misch wrote:
> On Sun, Apr 06, 2025 at 07:42:02AM +0900, Michael Paquier wrote:
> > On Sat, Apr 05, 2025 at 12:13:39PM -0700, Noah Misch wrote:
> > > Since the 2025-02 releases made non-toy-size archive recoveries fail easily,
> > > that's not enough. If the proposed 3-second test is the wrong thing, what
> > > instead?
> >
> > I don't have a good idea about that in ~16, TBH, but I am sure to not
> > be a fan of the low reproducibility rate of this test as proposed.
> > It's not perfect, but as the design to fix the original race condition
> > has been introduced in v15, why not begin with a test in 17~ using
> > some injection points?
>
> Two reasons:
>
> a) The fix ended calls to the whole range of relevant code. Hence, the
> injection point placement that would have been relevant before the fix
> isn't reached. In other words, there's no right place for the injection
> point. (The place for the injection point would be in durable_rename(), in
> the checkpointer. After the fix, the checkpointer just doesn't call
> durable_rename().)
>
> b) Stochastic tests catch defects beyond the specific one the test author
> targeted. An injection point test is less likely to do that. (That said,
> with reason (a), there's no known injection point test design to compete
> with the stochastic design.)
Tom and Michael, do you still object to the test addition, or not? If there
are no new or renewed objections by 2025-04-20, I'll proceed to add the test.
As another data point, raising the runtime from 3s to 17s makes it reproduce
the problem 25% of the time. You can imagine a plot with axes of runtime and
percent detection. One can pick any point on that plot's curve. Given how
little wall time it takes for the buildfarm and CI to reach a few hundred
runs, I like the trade-off of 3s runtime and 1% detection. In particular, I
like it better than 17s runtime for 25% detection. How do you see it?
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2025-04-13 15:51:57 | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash |
Previous Message | Andrew Dunstan | 2025-04-13 15:19:29 | Re: Buildfarm: Enabling injection points on basilisk/dogfish (Alpine / musl) |