Re: Incorrect handling of OOM in WAL replay leading to data loss

From: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To: michael(at)paquier(dot)xyz
Cc: pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com
Subject: Re: Incorrect handling of OOM in WAL replay leading to data loss
Date: 2023-08-10 01:00:17
Message-ID: 20230810.100017.2120346592058471531.horikyota.ntt@gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

At Wed, 9 Aug 2023 17:44:49 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> > While it's a kind of bug in total, we encountered a case where an
> > excessively large xl_tot_len actually came from a corrupted
> > record. [1]
>
> Right, I remember this one. I think that Thomas was pretty much right
> that this could be caused because of a lack of zeroing in the WAL
> pages.

We have treated every kind of broken data as end-of-recovery, like
incorrect rm_id or prev link including excessively large record length
due to corruption. This patch is going to change the behavior only for
the last one. If you think there can't be non-zero broken data, we
should inhibit proceeding recovery after all non-zero incorrect
data. This seems to be a quite big change in our recovery policy.

> There are a few options on the table, only doable once the WAL reader
> provider the error state to the startup process:
> 1) Retry a few times and FATAL.
> 2) Just FATAL immediately and don't wait.
> 3) Retry and hope for the best that the host calms down.

4) Wrap up recovery then continue to normal operation.

This is the traditional behavior for currupt WAL data.

> I have not seeing this issue being much of an issue in the field, so
> perhaps option 2 with the structure of 0002 and a FATAL when we catch
> XLOG_READER_OOM in the switch would be enough. At least that's enough
> for the cases we've seen. I'll think a bit more about it, as well.
>
> Yeah, agreed. That's orthogonal to the issue reported by Ethan,
> unfortunately, where he was able to trigger the issue of this thread
> by manipulating the sizing of a host after producing a record larger
> than what the host could afford after the resizing :/

I'm not entirely certain, but if you were to ask me which is more
probable during recovery - encountering a correct record that's too
lengthy for the server to buffer or stumbling upon a corrupt byte
sequence - I'd bet on the latter.

I'm not sure how often users encounter currupt WAL data, but I believe
they should have the option to terminate recovery and then switch to
normal operation.

What if we introduced an option to increase the timeline whenever
recovery hits data error? If that option is disabled, the server stops
when recovery detects an incorrect data, except in the case of an
OOM. OOM cause record retry.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2023-08-10 01:15:38 Re: [PoC] pg_upgrade: allow to upgrade publisher node
Previous Message Julien Rouhaud 2023-08-10 00:56:43 Re: Fix last unitialized memory warning