Re: Infinite loop in XLogPageRead() on standby

From: Ants Aasma <ants(dot)aasma(at)cybertec(dot)at>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: cyberdemn(at)gmail(dot)com, michael(at)paquier(dot)xyz, pgsql-hackers(at)postgresql(dot)org, thomas(dot)munro(at)gmail(dot)com
Subject: Re: Infinite loop in XLogPageRead() on standby
Date: 2024-03-15 10:43:41
Message-ID: CANwKhkM_DzL2cMtDXN-qTuMSyx9Pyo4tBADFPJTVN65jkPoGUQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, 13 Mar 2024 at 04:56, Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Mon, 11 Mar 2024 16:43:32 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> > Oh, I once saw the fix work, but seems not to be working after some
> > point. The new issue was a corruption of received WAL records on the
> > first standby, and it may be related to the setting.
>
> I identified the cause of the second issue. When I tried to replay the
> issue, the second standby accidentally received the old timeline's
> last page-spanning record till the end while the first standby was
> promoting (but it had not been read by recovery). In addition to that,
> on the second standby, there's a time window where the timeline
> increased but the first segment of the new timeline is not available
> yet. In this case, the second standby successfully reads the
> page-spanning record in the old timeline even after the second standby
> noticed that the timeline ID has been increased, thanks to the
> robustness of XLogFileReadAnyTLI().
>
> I think the primary change to XLogPageRead that I suggested is correct
> (assuming the use of wal_segment_size instead of the
> constant). However, still XLogFileReadAnyTLI() has a chance to read
> the segment from the old timeline after the second standby notices a
> timeline switch, leading to the second issue. The second issue was
> fixed by preventing XLogFileReadAnyTLI from reading segments from
> older timelines than those suggested by the latest timeline
> history. (In other words, disabling the "AnyTLI" part).
>
> I recall that there was a discussion for commit 4bd0ad9e44, about the
> objective of allowing reading segments from older timelines than the
> timeline history suggests. In my faint memory, we concluded to
> postpone making the decision to remove the feature due to uncertainity
> about the objective. If there's no clear reason to continue using
> XLogFileReadAnyTLI(), I suggest we stop its use and instead adopt
> XLogFileReadOnTLHistory(), which reads segments that align precisely
> with the timeline history.

This sounds very similar to the problem described in [1]. And I think
both will be resolved by that change.

[1] https://postgr.es/m/CANwKhkMN3QwAcvuDZHb6wsvLRtkweBiYso-KLFykkQVWuQLcOw%40mail.gmail.com

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Teodor Sigaev 2024-03-15 10:57:13 Re: type cache cleanup improvements
Previous Message jian he 2024-03-15 10:30:18 Re: remaining sql/json patches