From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | michael(at)paquier(dot)xyz |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com |
Subject: | Re: Incorrect handling of OOM in WAL replay leading to data loss |
Date: | 2023-08-01 04:51:13 |
Message-ID: | 20230801.135113.1095735354684995020.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Tue, 1 Aug 2023 12:43:21 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> A colleague, Ethan Mertz (in CC), has discovered that we don't handle
> correctly WAL records that are failing because of an OOM when
> allocating their required space. In the case of Ethan, we have bumped
> on the failure after an allocation failure on XLogReadRecordAlloc():
> "out of memory while trying to decode a record of length"
I believe a database server is not supposed to be executed under such
a memory-constrained environment.
> In crash recovery, any records after the OOM would not be replayed.
> At quick glance, it seems to me that this can also impact standbys,
> where recovery could stop earlier than it should once a consistent
> point has been reached.
Actually the code is assuming that OOM happens solely due to a broken
record length field. I believe that we intentionally put that
assumption.
> A patch is registered in the commit fest to improve the error
> detection handling, but as far as I can see it fails to handle the OOM
> case and replaces ReadRecord() to use a WARNING in the redo loop:
> https://www.postgresql.org/message-id/20200228.160100.2210969269596489579.horikyota.ntt%40gmail.com
It doesn't change behavior unrelated to the case where the last record
is followed by zeroed trailing bytes.
> On top of my mind, any solution I can think of needs to add more
> information to XLogReaderState, where we'd either track the type of
> error that happened close to errormsg_buf which is where these errors
> are tracked, but any of that cannot be backpatched, unfortunately.
One issue on changing that behavior is that there's not a simple way
to detect a broken record before loading it into memory. We might be
able to implement a fallback mechanism for example that loads the
record into an already-allocated buffer (which is smaller than the
specified length) just to verify if it's corrupted. However, I
question whether it's worth the additional complexity. And I'm not
sure what if the first allocation failed.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Hayato Kuroda (Fujitsu) | 2023-08-01 04:51:55 | Fix compilation warnings when CFLAGS -Og is specified |
Previous Message | Andy Fan | 2023-08-01 04:38:57 | Extract numeric filed in JSONB more effectively |