Phantom segment upon promotion causing troubles.

From: Andres Freund <andres(at)anarazel(dot)de>
To: pgsql-hackers(at)postgresql(dot)org, Heikki Linnakangas <hlinnaka(at)iki(dot)fi>, gregburek(at)heroku(dot)com
Subject: Phantom segment upon promotion causing troubles.
Date: 2017-06-19 07:30:26
Message-ID: 20170619073026.zcwpe6mydsaz5ygd@alap3.anarazel.de
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

Hi,

Greg Burek from Heroku (CCed) reported a weird issue on IM, that was
weird enough to be interesting. What he'd observed was that he promoted
some PITR standby, and early clones of that node work, but later clones
did not, failing to read some segment.

The problems turns out to be the following: When a node is promoted at
a segment boundary, just after an XLOG_SWITCH record we'll have
EndOfLog = EndRecPtr;
pointing to the *beginning* of the next segment, as XLOG_SWITCH records
are treated as using the whole segment. After creating the
END_OF_RECOVERY record (or checkpoint), we'll do:

if (ArchiveRecoveryRequested)
{
/*
* We switched to a new timeline. Clean up segments on the old
* timeline.
*
* If there are any higher-numbered segments on the old timeline,
* remove them. They might contain valid WAL, but they might also be
* pre-allocated files containing garbage. In any case, they are not
* part of the new timeline's history so we don't need them.
*/
RemoveNonParentXlogFiles(EndOfLog, ThisTimeLineID);

note that this uses EndOfLog, pointing to ab/cd000000 (i.e. the
beginning of a record). RemoveNonParentXlogFiles calls
RemoveNonParentXlogFiles() which in turn uses RemoveXlogFile() to remove
superflous files. That's where the fun begins.

static void
RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
{
XLogSegNo endlogSegNo;
XLogSegNo recycleSegNo;
...
#define XLByteToPrevSeg(xlrp, logSegNo) \
logSegNo = ((xlrp) - 1) / XLogSegSize
...
XLByteToPrevSeg(endptr, endlogSegNo);
if (PriorRedoPtr == InvalidXLogRecPtr)
recycleSegNo = endlogSegNo + 10;
else
recycleSegNo = XLOGfileslop(PriorRedoPtr);
...
InstallXLogFileSegment(&endlogSegNo, path,
true, recycleSegNo, true))
...

So what happens here is that we're calling InstallXLogFileSegment() to
remove superflous xlog files (e.g. because they're before the recovery
target, because restore command ran before the trigger file was detected
or because walsender received them). But because endptr = ab/cd000000,
the use of XLByteToPrevSeg() means InstallXLogFileSegment() will be
called with the *previous* segment's segment number.

That in turn will lead to InstallXLogFileSegment() installing the
to-be-removed segment into the current timeline, but into a segment from
one *before* the creation of new timeline, for the purpose of recycling
the segment. I'll call this the "phantom" segment, which has no
meaningful content and lives on a timeline which does not yet exist.

As there's no .ready file created for that segment, and we'll never
actually write to it, it'll initially just sit around. Not visible for
archiving, and normally unused by wal streaming. But that changes at
later checkpoints, because, via RemoveOldXlogFiles()'s
XLogArchiveCheckDone() checks we:
/*
* XLogArchiveCheckDone
*
...
* If <XLOG>.done exists, then return true; else if <XLOG>.ready exists,
* then return false; else create <XLOG>.ready and return false.
*
* The reason we do things this way is so that if the original attempt to
* create <XLOG>.ready fails, we'll retry during subsequent checkpoints.

So we'll at some later point create a .ready for the above created
phantom segment. Which then will get archived.

At that point we're in trouble. If any standbys of that promoted node
catch up after that fact (or new ones are created from older base
backups), after the phantom segment has been archived, and
restore_command is set, recovery will fail. The reason for that is that
one commonly will have recovery_target_timeline = latest (or the new
timeline) set. And XLogFileReadAnyTLI() is pretty simplistic. When
restoring a segment it'll simply probe all timelines, starting from the
newest. Which means that, once archived, our phantom segment will "hide"
the actual segment from the source timeline. Because it's not parseable
(it's at a different segment, thus parsing decide it's unusable),
recovery will hang at that point.

Which means quick standbys catch up, slow ones are "dead". It's
"fixable" by creating a restore_command which filters that phantom
segment, or deleting the segment from the archive.

The minimal fix here is presumably not to use XLByteToPrevSeg() in
RemoveXlogFile(), but XLByteToSeg(). I don't quite see what purpose it
serves here - I don't think it's ever needed. Normally it's harmless
because InstallXLogFileSegment() checks where it could install the file
to, but that doesn't work around timeline bumps, triggering the problem
at hand. This seems to be very longstanding behaviour, I'm not sure
where it's originating from (hard to track due to code movement).

There seems to be a larger question ehre though: Why does
XLogFileReadAnyTLI() probe all timelines even if they weren't a parent
at that period? That seems like a bad idea, especially in more
complicated scenarios where some precursor timeline might live for
longer than it was a parent? ISTM XLogFileReadAnyTLI() should check
which timeline a segment ought to come from, based on the historY?

Comments?

Greetings,

Andres Freund

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Andres Freund 2017-06-19 07:41:23 Re: Phantom segment upon promotion causing troubles.
Previous Message Amit Langote 2017-06-19 07:04:04 Re: Adding support for Default partition in partitioning