| From: | Andres Freund <andres(at)anarazel(dot)de> | 
|---|---|
| To: | Nathan Bossart <nathandbossart(at)gmail(dot)com> | 
| Cc: | Robert Pang <robertpang(at)google(dot)com>, Michael Paquier <michael(at)paquier(dot)xyz>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org | 
| Subject: | Re: Back-patch of: avoid multiple hard links to same WAL file after a crash | 
| Date: | 2024-12-19 01:51:20 | 
| Message-ID: | 7chjz7zeigbsbt7nim4cj6zryflzcpy2lgenas3yh7cpvwf3gb@m6cwqzlvfd6q | 
| Views: | Whole Thread | Raw Message | Download mbox | Resend email | 
| Thread: | |
| Lists: | pgsql-hackers | 
Hi,
On 2024-12-18 10:38:19 -0600, Nathan Bossart wrote:
> On Tue, Dec 17, 2024 at 04:50:16PM -0800, Robert Pang wrote:
> > We recently observed a few cases where Postgres running on Linux
> > encountered an issue with WAL segment files. Specifically, two WAL
> > segments were linked to the same physical file after Postgres ran out
> > of memory and the OOM killer terminated one of its processes. This
> > resulted in the WAL segments overwriting each other and Postgres
> > failing a later recovery.
>
> Yikes!
Indeed.  As chance would have it, I was asked for input on a corrupted server
*today*. Eventually we found that recovery stopped early, after encountering a
segment with a *newer* pageaddr than we expected. Which made me think of this
issue, and indeed, the file recovery stopped at had two links.  Before that
the server had been crashing on a regular basis for unrelated reasons, which
presumably increased the chances sufficiently to eventually hit this problem.
It's a normal thing to discover the end of the WAL by finding a segment that
has an older pageaddr than its name suggests. But in this case we saw a newer
page address.  I wonder if we should treat that differently...
> > We found this fix [1] that has been applied to Postgres 16, but the
> > cases we observed were running Postgres 15. Given that older major
> > versions will be supported for a good number of years, and the
> > potential for irrecoverability exists (even if rare), we would like to
> > discuss the possibility of back-patching this fix.
>
> IMHO this is a good time to reevaluate.  It looks like we originally didn't
> back-patch out of an abundance of caution, but now that this one has had
> time to bake, I think it's worth seriously considering, especially now that
> we have a report from the field.
Strongly agreed.
I don't think the issue is actually quite as unlikely to be hit as reasoned in
the commit message.  The crash has indeed to happen between the link() and
unlink() - but at the end of a checkpoint we do that operations hundreds of
times in a row on a busy server.  And that's just after potentially doing lots
of write IO during a checkpoint, filling up drive write caches / eating up
IOPS/bandwidth disk quots.
Greetings,
Andres Freund
| From | Date | Subject | |
|---|---|---|---|
| Next Message | Amit Kapila | 2024-12-19 02:20:57 | Re: Memory leak in WAL sender with pgoutput (v10~) | 
| Previous Message | Masahiko Sawada | 2024-12-19 01:43:56 | Re: Skip collecting decoded changes of already-aborted transactions |