Re: Back-patch of: avoid multiple hard links to same WAL file after a crash

From: Noah Misch <noah(at)leadboat(dot)com>
To: Michael Paquier <michael(at)paquier(dot)xyz>
Cc: Andres Freund <andres(at)anarazel(dot)de>, Nathan Bossart <nathandbossart(at)gmail(dot)com>, Robert Pang <robertpang(at)google(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org
Subject: Re: Back-patch of: avoid multiple hard links to same WAL file after a crash
Date: 2025-03-06 19:30:13
Message-ID: 20250306193013.36.nmisch@google.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Thu, Dec 19, 2024 at 02:44:53PM +0900, Michael Paquier wrote:
> On Wed, Dec 18, 2024 at 08:51:20PM -0500, Andres Freund wrote:
> > I don't think the issue is actually quite as unlikely to be hit as reasoned in
> > the commit message. The crash has indeed to happen between the link() and
> > unlink() - but at the end of a checkpoint we do that operations hundreds of
> > times in a row on a busy server. And that's just after potentially doing lots
> > of write IO during a checkpoint, filling up drive write caches / eating up
> > IOPS/bandwidth disk quots.
>
> Looks so, yep. Your timing and the report's timing are interesting.
>
> I've been double-checking the code to refresh myself with the problem,
> and I don't see a reason to not apply something like the attached set
> down to v13 for all these remaining branches (minus an edit of the
> commit message).

This became v14 commit 1f95181b44c843729caaa688f74babe9403b5850. I think it
is causing timing-dependent failures in archive recovery, at restartpoints.
v14 and earlier were relying on the "excl" part of durable_rename_excl() to
keep WAL preallocation and recycling from stomping on archive-restored WAL
files. If that's right, we need to either back-patch v15's suppression of
those features during archive recovery (commit cc2c7d6 and five others), or we
need to revert and fix the multiple hard links differently.

v15 commit cc2c7d6 "Skip WAL recycling and preallocation during archive
recovery" wrote that it fixed "a spurious durable_rename_excl() LOG message".
Replacing durable_rename_excl() with durable_rename() increased consequences
beyond spurious messages. I regret that I didn't make that connection during
post-commit review of commit 1f95181b44c843729caaa688f74babe9403b5850.
Trouble emerges during archive recovery, not during streaming or crash
recovery. The symptom is "invalid magic number 0000 in log segment X, offset
0" when PreallocXlogFiles() stomps. The symptom is "unexpected pageaddr X in
log segment Y, offset 0" [X < Y] when RemoveOldXlogFiles() recycling stomps.
wal_recycle=off probably avoids the latter but not the former.

Options I see:

1. Make v14 and v13 skip WAL recycling and preallocation during archive
recovery, like newer branches do. I think that means back-patching the six
commits cc2c7d6~4 cc2c7d6~3 cc2c7d6~2 cc2c7d6~1 cc2c7d6 e36cbef.

2. Revert 1f95181b44c843729caaa688f74babe9403b5850 and its v13 counterpart.
Avoid multiple hard links by making durable_rename_excl() first rename
oldfile to a temporary name. (I haven't thought this through in detail.
It may not suffice.)

I'm leaning toward (1) at least enough to see how messy the back-patch would
be, since I don't like risks of designing old-branch-specific solutions when
v15/v16/v17 have a proven solution. What else should we be thinking about
before deciding?

Arun Thirupathi collected the symptoms, traced them to commit
1f95181b44c843729caaa688f74babe9403b5850, and reported the diagnosis to me.
These symptoms have not emerged in v15+.

Thanks,
nm

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Nathan Bossart 2025-03-06 19:33:13 Re: Statistics Import and Export
Previous Message Andres Freund 2025-03-06 19:28:20 Re: what's going on with lapwing?