From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | michael(at)paquier(dot)xyz |
Cc: | pgsql-hackers(at)lists(dot)postgresql(dot)org, ethmertz(at)amazon(dot)com, nathandbossart(at)gmail(dot)com, pgsql(at)j-davis(dot)com, sawada(dot)mshk(at)gmail(dot)com |
Subject: | Re: Incorrect handling of OOM in WAL replay leading to data loss |
Date: | 2023-08-09 08:00:49 |
Message-ID: | 20230809.170049.2032567705309253841.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Wed, 9 Aug 2023 16:35:09 +0900, Michael Paquier <michael(at)paquier(dot)xyz> wrote in
> Or perhaps just XLOG_READER_NO_ERROR?
Looks fine.
> > 0002 shifts the behavior for the OOM case from ending recovery to
> > retrying at the same record. If the last record is really corrupted,
> > the server won't be able to finish recovery. I doubt we are good with
> > this behavior change.
>
> You mean on an incorrect xl_tot_len? Yes that could be possible.
> Another possibility would be a retry logic with an hardcoded number of
> attempts and a delay between each. Once the infrastructure is in
> place, this still deserves more discussions but we can be flexible.
> The immediate FATAL is choice.
While it's a kind of bug in total, we encountered a case where an
excessively large xl_tot_len actually came from a corrupted
record. [1]
I'm glad to see this infrastructure comes in, and I'm on board with
retrying due to an OOM. However, I think we really need official steps
to wrap up recovery when there is a truly broken, oversized
xl_tot_len.
[1] https://www.postgresql.org/message-id/17928-aa92416a70ff44a2@postgresql.org
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Peter Eisentraut | 2023-08-09 08:07:08 | Re: Fix last unitialized memory warning |
Previous Message | Michael Paquier | 2023-08-09 07:56:27 | Re: WIP: new system catalog pg_wait_event |