Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done

From: Roman Eskin <r(dot)eskin(at)arenadata(dot)io>
To: pgsql-hackers(at)lists(dot)postgresql(dot)org
Subject: Timeline issue if StartupXLOG() is interrupted right before end-of-recovery record is done
Date: 2025-01-17 01:05:53
Message-ID: f7c8204e-ba32-4bf0-aef3-f0f9e55a3347@arenadata.io
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hello pgsql-hackers,

It appears to exist an issue with timeline switching in archive
recovery mode if the startup process is interrupted by some external
reason (for ex. system power loss) in a specific tight time window
(after a new history file is created but before end-of-recovery record
is done). If this happens, and the Postgres instance is launched
again, it carries out the common crash recovery, starts successfully
to 'in production' state, but continues the previous timeline. Thus,
we are having inconsistent WAL state, where we are writing WAL segment
files to old timeline and having presented new orphaned timeline in
the history file.

This issue can reveal itself in streaming replication scenario, for
ex.:
1. Assume we have instance_1 (Master) and instance_2 (Replica);
2. instance_1 (Master) fails for any reason, instance_2 (Replica)
starts to promote. But while instance_2 is still in `StartupXLOG()`
somewhere between `writeTimeLineHistory()` and
`CreateEndOfRecoveryRecord()`, we are getting, for example, power
loss on instance_2 server.
3. Once instance_2 server is up again, instance_2 is restarted. The
problem with timeline already exists here, but very likely to be
unnoticed, as instance_2 (now Master) is accepting connections and
working.
4. Bring back instance_1 as Replica.
5. Switch back instance_1 and instance_2 to the original
configuration. And here, when we try to start instance_2 as Replica,
we'll get a FATAL:
"FATAL: requested timeline 2 is not a child of this server's history
DETAIL: Latest checkpoint is at 0/303FF90 on timeline 1, but in the
history of the requested timeline, the server forked off from that
timeline at 0/3023538."

I agree that one have to be very unlucky to get into such situation
with 2 consequent failures and to hit the specific time window, but if
this happens, it will look quite severe and hard to debug. We faced
this situation on our synthetic tests.

To reproduce the issue on latest code from Postgres master branch in a
stable manner, I used the attached
'test-code-for-timeline-issue.patch'. It postpones execution in
`StartupXLOG()` if we need to hit the requested time window. Detailed
steps to reproduce I've used with the patch are in the attached file
'steps - NG scenario.txt'.

Possible way to fix this issue, that we now consider, is to postpone
the removal of signal files ("recovery.signal" and "standby.signal")
until the end-of-recovery record is done, and thus retry interrupted
promotion in case previous one failed. Looks like, the only
insignificant negative outcome for this approach is possible
assignment of one more timeline identity. But we would like to get
more feedback from the community. Do you foresee any possible
drawbacks for this approach? Please advise.

Many thanks in advance!

P.S. The topic above is a bit similar to this old one -
https://www.postgresql.org/message-id/flat/CABUevEz09XY2EevA2dLjPCY-C5UO4Hq%3DXxmXLmF6ipNFecbShQ%40mail.gmail.com.

Best regards,
Roman Eskin

Attachment Content-Type Size
steps - NG scenario.txt text/plain 3.3 KB
test-code-for-timeline-issue.patch text/plain 702 bytes

Browse pgsql-hackers by date

  From Date Subject
Next Message Masahiko Sawada 2025-01-17 01:06:14 Re: Parallel heap vacuum
Previous Message Noah Misch 2025-01-17 00:52:21 Re: An improvement of ProcessTwoPhaseBuffer logic