Quick Links

Re: Incorrect handling of OOM in WAL replay leading to data loss

From:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
To:	michael(at)paquier(dot)xyz
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com
Subject:	Re: Incorrect handling of OOM in WAL replay leading to data loss
Date:	2023-08-01 04:51:13
Message-ID:	20230801.135113.1095735354684995020.horikyota.ntt@gmail.com
Views:	Raw Message \| Whole Thread \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

At Tue, 1 Aug 2023 12:43:21 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> A colleague, Ethan Mertz (in CC), has discovered that we don't handle
> correctly WAL records that are failing because of an OOM when
> allocating their required space. In the case of Ethan, we have bumped
> on the failure after an allocation failure on XLogReadRecordAlloc():
> "out of memory while trying to decode a record of length"

I believe a database server is not supposed to be executed under such
a memory-constrained environment.

> In crash recovery, any records after the OOM would not be replayed.
> At quick glance, it seems to me that this can also impact standbys,
> where recovery could stop earlier than it should once a consistent
> point has been reached.

Actually the code is assuming that OOM happens solely due to a broken
record length field. I believe that we intentionally put that
assumption.

> A patch is registered in the commit fest to improve the error
> detection handling, but as far as I can see it fails to handle the OOM
> case and replaces ReadRecord() to use a WARNING in the redo loop:
> https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com

It doesn't change behavior unrelated to the case where the last record
is followed by zeroed trailing bytes.

> On top of my mind, any solution I can think of needs to add more
> information to XLogReaderState, where we'd either track the type of
> error that happened close to errormsg_buf which is where these errors
> are tracked, but any of that cannot be backpatched, unfortunately.

One issue on changing that behavior is that there's not a simple way
to detect a broken record before loading it into memory. We might be
able to implement a fallback mechanism for example that loads the
record into an already-allocated buffer (which is smaller than the
specified length) just to verify if it's corrupted. However, I
question whether it's worth the additional complexity. And I'm not
sure what if the first allocation failed.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

In response to

Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-01 03:43:21 from Michael Paquier

Responses

Re: Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-01 05:03:36 from Michael Paquier

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Hayato Kuroda (Fujitsu)	2023-08-01 04:51:55	Fix compilation warnings when CFLAGS -Og is specified
Previous Message	Andy Fan	2023-08-01 04:38:57	Extract numeric filed in JSONB more effectively