From: | Robert Pang <robertpang(at)google(dot)com> |
---|---|
To: | Michael Paquier <michael(at)paquier(dot)xyz> |
Cc: | Nathan Bossart <nathandbossart(at)gmail(dot)com>, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>, Tom Lane <tgl(at)sss(dot)pgh(dot)pa(dot)us>, robertmhaas(at)gmail(dot)com, pgsql-hackers(at)postgresql(dot)org |
Subject: | Back-patch of: avoid multiple hard links to same WAL file after a crash |
Date: | 2024-12-18 00:50:16 |
Message-ID: | CAJhEC04tBkYPF4q2uS_rCytauvNEVqdBAzasBEokfceFhF=KDQ@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
Dear team,
We recently observed a few cases where Postgres running on Linux
encountered an issue with WAL segment files. Specifically, two WAL
segments were linked to the same physical file after Postgres ran out
of memory and the OOM killer terminated one of its processes. This
resulted in the WAL segments overwriting each other and Postgres
failing a later recovery.
We found this fix [1] that has been applied to Postgres 16, but the
cases we observed were running Postgres 15. Given that older major
versions will be supported for a good number of years, and the
potential for irrecoverability exists (even if rare), we would like to
discuss the possibility of back-patching this fix.
Are there any technical reasons not to back-patch this fix to older
major versions?
Thank you for your consideration.
Sincerely,
Robert Pang
[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=dac1ff3
On Sat, May 7, 2022 at 1:19 AM Michael Paquier <michael(at)paquier(dot)xyz> wrote:
>
> On Thu, May 05, 2022 at 08:10:02PM +0900, Michael Paquier wrote:
> > I'd agree with removing all the callers at the end. pgrename() is
> > quite robust on Windows, but I'd keep the two checks in
> > writeTimeLineHistory(), as the logic around findNewestTimeLine() would
> > consider a past TLI history file as in-use even if we have a crash
> > just after the file got created in the same path by the same standby,
> > and the WAL segment init part. Your patch does that.
>
> As v16 is now open for business, I have revisited this change and
> applied 0001 to change all the callers (aka removal of the assertion
> for the WAL receiver when it overwrites a TLI history file). The
> commit log includes details about the reasoning of all the areas
> changed, for clarity, as of the WAL recycling part, the TLI history
> file part and basic_archive.
> --
> Michael
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2024-12-18 01:02:03 | Re: Pg18 Recursive Crash |
Previous Message | Jeff Davis | 2024-12-17 23:56:09 | Final result (display) collation? |