Re: PITR promote bug: Checkpointer writes to older timeline

From: Michael Paquier <michael(at)paquier(dot)xyz>
To: Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com>
Cc: masao(dot)fujii(at)oss(dot)nttdata(dot)com, soumyadeep2007(at)gmail(dot)com, hlinnaka(at)iki(dot)fi, pgsql-hackers(at)postgresql(dot)org, jyih(at)vmware(dot)com, kyeap(at)vmware(dot)com
Subject: Re: PITR promote bug: Checkpointer writes to older timeline
Date: 2021-03-18 03:56:12
Message-ID: YFLPXIp5rE45so51@paquier.xyz
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

On Wed, Mar 17, 2021 at 05:09:50PM +0900, Michael Paquier wrote:
> Currently with HEAD and back branches, nothing would be broken as
> logical contexts cannot exist in recovery. Still it would be easy
> to miss the new behavior for anybody attempting to work more on this
> feature in the future if we change read_local_xlog_page to not update
> ThisTimeLineID anymore. Resetting the value of ThisTimeLineID in
> read_local_xlog_page() does not seem completely right either with this
> argument, as they could be some custom code relying on the existing
> behavior of read_local_xlog_page() to maintain ThisTimeLineID.

I was looking at uses of ThisTimeLineID in the wild, and could not
find it getting checked or used actually in backend-side code that
involved the WAL reader facility. Even if it brings confidence, it
does not mean that it is not used somewhere :/

> Hmmm. I am wondering whether the best answer for the moment would not
> be to save and reset ThisTimeLineID just in XlogReadTwoPhaseData(), as
> that's the local change that uses read_local_xlog_page().

And attached is the patch able to achieve that. At least it is
simple, and does not break the actual assumptions this callback relies
on. This is rather weak though if there are errors as this is out of
critical sections, still the disease is worse. I have double-checked
all the existing backend code that uses XLogReadRecord(), and did not
notice any code paths with issues similar to this one.

> The state of the code is really confusing on HEAD, and I'd like to
> think that the best thing we could do in the long-term is to make the
> logical decoding path not rely on ThisTimeLineID at all and decouple
> all that, putting the code in a state sane enough so as we don't
> finish with similar errors as what is discussed on this thread. That
> would be a work for a different patch though, not for stable
> branches. And seeing some slot and advancing functions update
> directly ThisTimeLineID is no good either..

However, I'd like to think that we should completely untie the
dependency to ThisTimeLineID in any page read callbacks in core in the
long-term, and potentially clean up any assumptions behind timeline
jumps while in recovery for logical contexts as that cannot happen.
At this stage of the 14 dev cycle, that would be material for 15~, but
I also got to wonder if there is work going on to support logical
decoding on standbys, in particular if this would really rely on
ThisTimeLineID.

Thoughts are welcome.
--
Michael

Attachment Content-Type Size
fix-pitr-2pc-v4.patch text/x-diff 5.1 KB

In response to

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Mark Dilger 2021-03-18 04:00:39 Re: pg_amcheck contrib application
Previous Message tsunakawa.takay@fujitsu.com 2021-03-18 03:38:09 RE: libpq debug log