Quick Links

archive recovery fetching wrong segments

From:	Grigory Smolkin <g(dot)smolkin(at)postgrespro(dot)ru>
To:	pgsql-hackers(at)postgresql(dot)org
Subject:	archive recovery fetching wrong segments
Date:	2020-04-06 00:02:37
Message-ID:	dd6690b0-ec03-6b3c-6fac-c963f91f87a7@postgrespro.ru
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-hackers

Hello, hackers!

I`m investigating a complains from our clients about archive recovery
speed been very slow, and I`ve noticed a really strange and, I think, a
very dangerous recovery behavior.

When running multi-timeline archive recovery, for every requested segno
startup process iterates through every timeline in restore target
timeline history, starting from highest timeline and ending in current,
and tries to fetch the segno in question from this timeline.

Consider the following example.
Timelines:
ARCHIVE INSTANCE 'node'
================================================================================================================================

TLI Parent TLI Switchpoint Min Segno Max
Segno N segments Size Zratio N backups Status
================================================================================================================================

3    2           0/AEFFEDE0   0000000300000000000000AE
0000000300000000000000D5 40          41MB   15.47   0 OK
2    1           0/A08768D0   0000000200000000000000A0
0000000200000000000000AE 15          14MB   17.24   0 OK
1    0           0/0          000000010000000000000001
0000000100000000000000BB 187         159MB 18.77   1          OK

Backup:
================================================================================================================================

Instance Version ID Recovery Time Mode WAL Mode
TLI Time Data WAL Zratio Start LSN Stop LSN Status
================================================================================================================================

node 11 Q8C8IH 2020-04-06 02:13:31+03 FULL ARCHIVE 1/0
3s 23MB 16MB 1.00 0/2000028 0/30000B8 OK

So when we are trying to restore this backup, located on Timeline 1, to
the restore target on Timeline 3, we are getting this in the PostgreSQL
log:
....
2020-04-05 23:24:36 GMT [28508]: [5-1] LOG: restored log file
"00000003.history" from archive
INFO: PID [28511]: pg_probackup archive-get WAL file:
000000030000000000000002, remote: none, threads: 1/1, batch: 20
ERROR: PID [28511]: pg_probackup archive-get failed to deliver WAL file
000000030000000000000002, prefetched: 0/20, time elapsed: 0ms
INFO: PID [28512]: pg_probackup archive-get WAL file:
000000020000000000000002, remote: none, threads: 1/1, batch: 20
ERROR: PID [28512]: pg_probackup archive-get failed to deliver WAL file
000000020000000000000002, prefetched: 0/20, time elapsed: 0ms
INFO: PID [28513]: pg_probackup archive-get WAL file:
000000010000000000000002, remote: none, threads: 1/1, batch: 20
INFO: PID [28513]: pg_probackup archive-get copied WAL file
000000010000000000000002
2020-04-05 23:24:36 GMT [28508]: [6-1] LOG: restored log file
"000000010000000000000002" from archive
...

Before requesting 000000010000000000000002 recovery tries to fetch
000000030000000000000002 and 000000020000000000000002 and that goes for
every segment, restored from the archive.
This tremendously slows down recovery speed, especially if archive is
located on remote machine with high latency network.
And it also may lead to feeding recovery with wrong WAL segment, located
on the next timeline.

Is there a reason behind this behavior?

Also I`ve attached a patch, which fixed this issue for me, but I`m not
sure, that chosen approach is sound and didn`t break something.

--
Grigory Smolkin
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment	Content-Type	Size
blind_recovery_fix.patch	text/x-patch	1.2 KB

Responses

Re: archive recovery fetching wrong segments at 2020-04-06 18:17:52 from David Steele

Browse pgsql-hackers by date

	From	Date	Subject
Next Message	Andres Freund	2020-04-06 00:07:15	Re: CLOBBER_CACHE_ALWAYS regression instability
Previous Message	Alvaro Herrera	2020-04-05 23:54:19	Re: CLOBBER_CACHE_ALWAYS regression instability