From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Race condition in recovery? |
Date: | 2021-05-19 12:16:05 |
Message-ID: | CAFiTN-tJ8gKs0+f7wsybdd3dUX73ZxiSEKN9vjso2=GnhgTJjw@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Tue, May 18, 2021 at 12:22 PM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
> And finally I think I could reach the situation the commit wanted to fix.
>
> I took a basebackup from a standby just before replaying the first
> checkpoint of the new timeline (by using debugger), without copying
> pg_wal. In this backup, the control file contains checkPointCopy of
> the previous timeline.
>
> I modified StartXLOG so that expectedTLEs is set just after first
> determining recoveryTargetTLI, then started the grandchild node. I
> have the following error and the server fails to continue replication.
> [postmaster] LOG: starting PostgreSQL 14beta1 on x86_64-pc-linux-gnu...
> [startup] LOG: database system was interrupted while in recovery at log...
> [startup] LOG: set expectedtles tli=6, length=1
> [startup] LOG: Probing history file for TLI=7
> [startup] LOG: entering standby mode
> [startup] LOG: scanning segment 3 TLI 6, source 0
> [startup] LOG: Trying fetching history file for TLI=6
> [walreceiver] LOG: fetching timeline history file for timeline 5 from pri...
> [walreceiver] LOG: fetching timeline history file for timeline 6 from pri...
> [walreceiver] LOG: started streaming ... primary at 0/3000000 on timeline 5
> [walreceiver] DETAIL: End of WAL reached on timeline 5 at 0/30006E0.
> [startup] LOG: unexpected timeline ID 1 in log segment 000000050000000000000003, offset 0
> [startup] LOG: Probing history file for TLI=7
> [startup] LOG: scanning segment 3 TLI 6, source 0
> (repeats forever)
So IIUC, this logs shows that
"ControlFile->checkPointCopy.ThisTimeLineID" is 6 but
"ControlFile->checkPoint" record is on TL 5? I think if you had the
old version of the code (before the commit) or below code [1], right
after initializing expectedTLEs then you would have hit the FATAL the
patch had fix.
While debugging did you check what was the "ControlFile->checkPoint"
LSN vs the first LSN of the first segment with TL6?
expectedTLEs = readTimeLineHistory(recoveryTargetTLI);
[1]
if (tliOfPointInHistory(ControlFile->checkPoint, expectedTLEs) !=
ControlFile->checkPointCopy.ThisTimeLineID)
{
report(FATAL..
}
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Amit Kapila | 2021-05-19 12:25:52 | Re: Refactor "mutually exclusive options" error reporting code in parse_subscription_options |
Previous Message | David Rowley | 2021-05-19 12:14:47 | Re: Condition pushdown: why (=) is pushed down into join, but BETWEEN or >= is not? |