Quick Links

Re: [BUG] Archive recovery failure on 9.3+.

From:	Heikki Linnakangas <hlinnakangas(at)vmware(dot)com>
To:	Kyotaro HORIGUCHI <horiguchi(dot)kyotaro(at)lab(dot)ntt(dot)co(dot)jp>
Cc:	katsumata(dot)tomonari(at)po(dot)ntts(dot)co(dot)jp, pgsql-hackers(at)postgresql(dot)org
Subject:	Re: [BUG] Archive recovery failure on 9.3+.
Date:	2014-02-13 12:42:27
Message-ID:	52FCBDB3.30206@vmware.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

On 02/13/2014 01:37 PM, Kyotaro HORIGUCHI wrote:
> At Thu, 13 Feb 2014 10:11:22 +0200, Heikki Linnakangas <hlinnakangas(at)vmware(dot)com> wrote in <52FC7E2A(dot)9060703(at)vmware(dot)com>
>> On 02/13/2014 08:44 AM, Kyotaro HORIGUCHI wrote:
>>>>>>> Wouldn't it be better to not archive the old segment, and instead
>>>>>>> switch
>>>>>>> to a new segment after writing the end-of-recovery checkpoint, so that
>>>>>>> the segment on the new timeline is archived sooner?
>>>>>>
>>>>>> It would be better to zero-fill and switch segments, yes. We should
>>>>>> NEVER be in a position of archiving two different versions of the same
>>>>>> segment.
>>>>>
>>>>> Ok, I think we're in agreement that that's the way to go for master.
>
> Does this mean that no mechanical solution for this situation
> will be given and an operator should remove the older segment for
> the same segment id before starting recovoery process?

I don't understand. With the proposed change to never archive a partial
segment, there is no segment with the same filename in the archive.
There is nothing to remove. Or did you mean that you have to do that in
9.0-9.2 back-branches, if we only back-patch this change to 9.3?

>> If it's already in the archive, it's not going to be removed from the
>> archive.
>
> I have understood the conclusion so far is not archiving the
> older segment when promotion but it seems a bit odd story as you
> suggested. If the conclusion here is no aid as my new
> understanding, would you let me hear the reason why recovery have
> changed to prefer archive to pg_xlog?
>
> The commit abf5c5c9a4 seems to change the behavior but I don't
> find the reason for the change.

Well, it didn't work too well before abf5c5c9a4 either. With your test
script, and commit abf5c5c9a4 reverted [1], you got this:

### STAGE 3/3: 2/2 START SERVER IN ARCHIVE RECOVERY MODE 2ND RUN ###
waiting for server to start....LOG: database system was interrupted;
last known up at 2014-02-13 14:17:53 EET
LOG: starting archive recovery
LOG: restored log file "00000002.history" from archive
/bin/cp: cannot stat
‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such
file or directory
LOG: restored log file "000000010000000000000002" from archive
LOG: record with zero length at 0/2000090
LOG: invalid primary checkpoint record
/bin/cp: cannot stat
‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such
file or directory
LOG: restored log file "000000010000000000000002" from archive
LOG: using previous checkpoint record at 0/2000028
LOG: consistent recovery state reached at 0/2000028
LOG: record with zero length at 0/2000090
LOG: redo is not required
/bin/cp: cannot stat
‘/home/heikki/pgsql.93stable/test/arc/000000020000000000000002’: No such
file or directory
LOG: restored log file "000000010000000000000002" from archive
/bin/cp: cannot stat
‘/home/heikki/pgsql.93stable/test/arc/00000003.history’: No such file or
directory
LOG: selected new timeline ID: 3
LOG: restored log file "00000002.history" from archive
LOG: archive recovery complete
LOG: database system is ready to accept connections
LOG: autovacuum launcher started

Note the "invalid primary checkpoint record" message. It restored the
partial 000000010000000000000002 file from the archive, and tried to
recover using that instead of the 000000020000000000000002 file from
pg_xlog. The primary checkpoint record is not present in
000000010000000000000002, so it fell back to recover from the previous
checkpoint.

It was the exact same issue, even though it appeared to work if you
don't look carefully. The second recovery never replayed the WAL
generated on timeline 2, it branched off directly from timeline 1, at
the same point where timeline 2 did. You can see that by looking at the
history files. Also, if you create a table and insert a row to it after
the first recovery, just before killing the server, you'll see that the
row is not there after the second recovery. That's wrong; the data files
might already contain some changes from timeline 2, and you end up with
a corrupt database. Thanks to commit abf5c5c9a4, the situation is at
least detected, and the system refuses to start up.

The behavior where we prefer a segment from archive with lower TLI over
a file with higher TLI in pg_xlog actually changed in commit
a068c391ab0. Arguably changing it wasn't a good idea, but the problem
your test script demonstrates can be fixed by not archiving the partial
segment, with no change to the preference of archive/pg_xlog. As
discussed, archiving a partial segment seems like a bad idea anyway, so
let's just stop doing that.

[1] just applying commit abf5c5c9a4 in reverse doesn't apply cleanly due
to further changes to the code, but I did this to simulate the old behavior:

diff --git a/src/backend/access/transam/xlog.c
b/src/backend/access/transam/xlog.c
index 6a4a504..352d3f7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4366,6 +4366,7 @@ readRecoveryCommandFile(void)

/* Enable fetching from archive recovery area */
ArchiveRecoveryRequested = true;
+ InArchiveRecovery = true;

/*
* If user specified recovery_target_timeline, validate it or compute the

- Heikki

In response to

Re: [BUG] Archive recovery failure on 9.3+. at 2014-02-13 11:37:06 from Kyotaro HORIGUCHI

Responses

Re: [BUG] Archive recovery failure on 9.3+. at 2014-02-13 16:47:54 from Heikki Linnakangas

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Christoph Berg	2014-02-13 12:42:45	Re: [BUG] Archive recovery failure on 9.3+.
Previous Message	Kohei KaiGai	2014-02-13 12:28:34	Re: New hook after raw parsing, before analyze