Quick Links

Re: Incorrect handling of OOM in WAL replay leading to data loss

From:	Michael Paquier <michael(at)paquier(dot)xyz>
To:	Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc:	pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com
Subject:	Re: Incorrect handling of OOM in WAL replay leading to data loss
Date:	2023-08-08 07:29:49
Message-ID:	ZNHu7eY9t8xdFgFG@paquier.xyz
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On Wed, Aug 02, 2023 at 01:16:02PM +0900, Kyotaro Horiguchi wrote:
> I believe this approach is sufficient to determine whether the error
> is OOM or not. If total_len is currupted and has an excessively large
> value, it's highly unlikely that all subsequent pages for that length
> will be consistent.
>
> Do you have any thoughts on this?

This could be more flexible IMO, and actually in some cases
errormsg_fatal may be eaten if using the WAL prefetcher as the error
message is reported with the caller of XLogPrefetcherReadRecord(), no?

Anything that has been discussed on this thread now involves a change
in XLogReaderState that induces an ABI breakage. For HEAD, we are
likely going in this direction, but if we are going to bite the bullet
we'd better be a bit more aggressive with the integration and report
an error code side-by-side with the error message returned by
XLogPrefetcherReadRecord(), XLogReadRecord() and XLogNextRecord() so
as all of the callers can decide what they want to do on an invalid
record or just an OOM.

Attached is the idea of infrastructure I have in mind, as of 0001,
where this adds an error code to report_invalid_record(). For now
this includes three error codes appended to the error messages
generated that can be expanded if need be: no error, OOM and invalid
data. The invalid data part may needs to be much more verbose, and
could be improved to make this stuff "less scary" as the other thread
proposes, but what I have here would be enough to trigger a different
decision in the startup process if a record cannot be fetched on OOM
or if there's a different reason behind that.

0002 is an example of decision that can be taken in WAL replay if we
see an OOM, based on the error code received. One argument is that we
may want to improve emode_for_corrupt_record() so as it reacts better
on OOM, upgrading the emode wanted, but this needs more analysis
depending on the code path involved.

0003 is my previous trick to inject an OOM failure at replay. Reusing
the previous script, this would be enough to prevent an early redo
creating a loss of data.

Note that we have a few other things going in the tree. As one
example, pg_walinspect would consider an OOM as the end of WAL. Not
critical, still slightly incorrect as the end of WAL may not have been
reached yet so it can report some incorrect information depending on
what the WAL reader faces. This could be improved with the additions
of 0001.

Thoughts or comments?
--
Michael

Attachment	Content-Type	Size
v1-0001-Add-infrastructure-to-report-error-codes-in-WAL-r.patch	text/x-diff	27.4 KB
v1-0002-Force-a-FATAL-when-facing-OOM-in-WAL-replay.patch	text/x-diff	1.3 KB
v1-0003-Tweak-to-force-OOM-behavior-when-replaying-record.patch	text/x-diff	1.5 KB

In response to

Re: Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-02 04:16:02 from Kyotaro Horiguchi

Responses

Re: Incorrect handling of OOM in WAL replay leading to data loss at 2023-08-08 08:44:03 from Kyotaro Horiguchi

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	John Naylor	2023-08-08 07:31:42	Re: Doc limitation update proposal: include out-of-line OID usage per TOAST-ed columns
Previous Message	YANG Xudong	2023-08-08 07:22:34	Re: [PATCH] Add loongarch native checksum implementation.