From: | Dilip Kumar <dilipbalaut(at)gmail(dot)com> |
---|---|
To: | Kyotaro Horiguchi <horikyota(dot)ntt(at)gmail(dot)com> |
Cc: | Robert Haas <robertmhaas(at)gmail(dot)com>, PostgreSQL Hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org> |
Subject: | Re: Race condition in recovery? |
Date: | 2021-05-07 05:34:53 |
Message-ID: | CAFiTN-spAMc6WsobbphZDDz+QuwNOmWfTeR6d2BX3W=_NMmP9g@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-hackers |
On Fri, May 7, 2021 at 8:23 AM Kyotaro Horiguchi
<horikyota(dot)ntt(at)gmail(dot)com> wrote:
>
> At Tue, 4 May 2021 17:41:06 +0530, Dilip Kumar <dilipbalaut(at)gmail(dot)com> wrote in
> Could you please fix the test script so that it causes your issue
> correctly? And/or elaborate a bit more?
>
> The attached first file is the debugging aid logging. The second is
> the test script, to be placed in src/test/recovery/t.
I will look into your test case and try to see whether we can
reproduce the issue. But let me summarise what is the exact issue.
Basically, the issue is that first in validateRecoveryParameters if
the recovery target is the latest then we fetch the latest history
file and set the recoveryTargetTLI timeline to the latest available
timeline assume it's 2 but we delay updating the expectedTLEs (as per
commit ee994272ca50f70b53074f0febaec97e28f83c4e). Now, while reading
the checkpoint record if we don't get the required WAL from the
archive then we try to get from primary, and while getting checkpoint
from primary we use "ControlFile->checkPointCopy.ThisTimeLineID"
suppose that is older timeline 1. Now after reading the checkpoint we
will set the expectedTLEs based on the timeline from which we got the
checkpoint record.
See below Logic in WaitForWalToBecomeAvailable
if (readFile < 0)
{
if (!expectedTLEs)
expectedTLEs = readTimeLineHistory(receiveTLI);
Now, the first problem is we are breaking the sanity of expectedTLEs
because as per the definition it should already start with
recoveryTargetTLI but it is starting with the older TLI. Now, in
rescanLatestTimeLine we are trying to fetch the latest TLI which is
still 2, so this logic returns without reinitializing the expectedTLEs
because it assumes that if recoveryTargetTLI is pointing to 2 then
expectedTLEs must be correct and need not be changed.
See below logic:
rescanLatestTimeLine(void)
{
....
newtarget = findNewestTimeLine(recoveryTargetTLI);
if (newtarget == recoveryTargetTLI)
{
/* No new timelines found */
return false;
}
...
newExpectedTLEs = readTimeLineHistory(newtarget);
...
expectedTLEs = newExpectedTLEs;
Solution:
1. Find better way to fix the problem of commit
(ee994272ca50f70b53074f0febaec97e28f83c4e) which is breaking the
sanity of expectedTLEs.
2. Assume, we have to live with fix 1 and we have to initialize
expectedTLEs with an older timeline for validating the checkpoint in
absence of tl.hostory file (as this commit claims). Then as soon as
we read and validate the checkpoint, fix the expectedTLEs and set it
based on the history file of recoveryTargetTLI.
Does this explanation make sense? If not please let me know what part
is not clear in the explanation so I can point to that code.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
From | Date | Subject | |
---|---|---|---|
Next Message | Japin Li | 2021-05-07 06:02:53 | Re: Identify missing publications from publisher while create/alter subscription. |
Previous Message | Pavel Stehule | 2021-05-07 05:17:23 | doc issue missing type name "multirange" in chapter title |