From: | Andres Freund <andres(at)anarazel(dot)de> |
---|---|
To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> |
Cc: | Robert Pang <robertpang(at)google(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash |
Date: | 2024-12-19 01:51:20 |
Message-ID: | 7chjz7zeigbsbt7nim4cj6zryflzcpy2lgenas3yh7cpvwf3gb@m6cwqzlvfd6q |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Hi,
On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote:
> On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:
> > We recently observed a few cases where Postgres running on Linux
> > encountered an issue with WAL segment files. Specifically, two WAL
> > segments were linked to the same physical file after Postgres ran out
> > of memory and the OOM killer terminated one of its processes. This
> > resulted in the WAL segments overwriting each other and Postgres
> > failing a later recovery.
>
> Yikes!
Indeed. As chance would have it, I was asked for input on a corrupted server
*today*. Eventually we found that recovery stopped early, after encountering a
segment with a *newer* pageaddr than we expected. Which made me think of this
issue, and indeed, the file recovery stopped at had two links. Before that
the server had been crashing on a regular basis for unrelated reasons, which
presumably increased the chances sufficiently to eventually hit this problem.
It's a normal thing to discover the end of the WAL by finding a segment that
has an older pageaddr than its name suggests. But in this case we saw a newer
page address. I wonder if we should treat that differently...
> > We found this fix [1] that has been applied to Postgres 16, but the
> > cases we observed were running Postgres 15. Given that older major
> > versions will be supported for a good number of years, and the
> > potential for irrecoverability exists (even if rare), we would like to
> > discuss the possibility of back-patching this fix.
>
> IMHO this is a good time to reevaluate. It looks like we originally didn't
> back-patch out of an abundance of caution, but now that this one has had
> time to bake, I think it's worth seriously considering, especially now that
> we have a report from the field.
Strongly agreed.
I don't think the issue is actually quite as unlikely to be hit as reasoned in
the commit message. The crash has indeed to happen between the link() and
unlink() - but at the end of a checkpoint we do that operations hundreds of
times in a row on a busy server. And that's just after potentially doing lots
of write IO during a checkpoint, filling up drive write caches / eating up
IOPS/bandwidth disk quots.
Greetings,
Andres Freund
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2024-12-19 02:20:57 | Re: Memory leak in WAL sender with pgoutput (v10~) |
Previous Message | Masahiko Sawada | 2024-12-19 01:43:56 | Re: Skip collecting decoded changes of already-aborted transactions |