From: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
---|---|
To: | robertmhaas(at)gmail(dot)com |
Cc: | dilipbalaut(at)gmail(dot)com, pgsql-hackers(at)lists(dot)postgresql(dot)org |
Subject: | Re: Race condition in recovery? |
Date: | 2021-05-14 05:24:30 |
Message-ID: | 20210514.142430.450479701172804252.horikyota.ntt@gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
At Fri, 14 May 2021 14:12:31 +0900 (JST), Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> wrote in
> At Thu, 13 May 2021 17:07:31 -0400, Robert Haas <robertmhaas(at)gmail(dot)com> wrote in
> > On Mon, May 10, 2021 at 4:35 AM Kyotaro Horiguchi
> > <horikyota(dot)ntt(at)gmail(dot)com> wrote:
> > > It seems to me the issue here is not a race condition but
> > > WaitForWALToBecomeAvailable initializing expectedTLEs with the history
> > > of a improper timeline. So using recoveryTargetTLI instead of
> > > receiveTLI for the case fixes this issue.
> >
> > I agree.
> >
> > > I believe the 004_timeline_switch.pl detects your issue. And the
> > > attached change fixes it.
> >
> > So why does this use recoveryTargetTLI instead of receiveTLI only
> > conditionally? Why not do it all the time?
>
> The commit ee994272ca apparently says that there's a case where primary
This is not an incomplete line but just a garbage.
> > The hard thing about this code is that the assumptions are not very
> > clear. If we don't know why something is a certain way, then we might
> > break things if we change it. Worse yet, if nobody else knows why it's
> > like that either, then who knows what assumptions they might be
> > making? It's hard to be sure that any change is safe.
>
> Thanks for the comment.
>
> > But that being said, we have a clear definition from the comments for
> > what expectedTLEs is supposed to contain, and it's only going to end
> > up with those contents if we initialize it from recoveryTargetTLI. So
> > I am inclined to think that we ought to do that always, and if it
>
> Yes, I also found it after that, and agreed. Desynchronization
> between recoveryTargetTLI and expectedTLEs prevents
> rescanLatestTimeline from working.
>
> > breaks something, then that's a sign that some other part of the code
> > also needs fixing, because apparently that hypothetical other part of
> > the code doesn't work if expctedTLEs contains what the comments say
> > that it should.
>
> After some more inspection, I'm now also sure that it is a
> typo/thinko. Other than while fetching the first checkpoint,
> receivedTLI is always in the history of recoveryTargetTLI, otherwise
> recoveryTargetTLI is equal to receiveTLI.
>
> I read that the commit message as "waiting for fetching possible
> future history files to know if there's the future for the current
> timeline. However now I read it as "don't bother expecting for
> possiblly-unavailable history files when we're reading the first
> checkpoint the timeline for which is already known to us.". If it is
> correct we don't bother considering future history files coming from
> primary there.
>
> > Now maybe that's the wrong idea. But if so, then we're saying that the
> > definition of expectedTLEs needs to be changed, and we should update
> > the comments with the new definition, whatever it is. A lot of the
> > confusion here results from the fact that the code and comments are
> > inconsistent and we can't tell whether that's intentional or
> > inadvertent. Let's not leave the next person who looks at this code
> > wondering the same thing about whatever changes we make.
>
> Ok. The reason why we haven't have a complain about that would be
> that it is rare that pg_wal is wiped out before a standby connects to
> a just-promoted primary. I'm not sure about the tool Dilip is using,
> though..
>
> So the result is the attached. This would be back-patcheable to 9.3
> (or 9.6?) but I doubt that we should do as we don't seem to have had a
> complaint on this issue and we're not full faith on this.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
From | Date | Subject | |
---|---|---|---|
Next Message | Dilip Kumar | 2021-05-14 05:27:34 | Re: OOM in spgist insert |
Previous Message | Dilip Kumar | 2021-05-14 05:13:20 | Re: parallel vacuum - few questions on docs, comments and code |