point-in-time-recovery issue when applying WAL files, "invalid magic number 0000 in log file" error

From: Radovan Jablonovsky <radovan(dot)jablonovsky(at)replicon(dot)com>
To: pgsql-admin <pgsql-admin(at)postgresql(dot)org>
Subject: point-in-time-recovery issue when applying WAL files, "invalid magic number 0000 in log file" error
Date: 2014-06-19 20:08:12
Message-ID: CAJYcdTv9uLPejVdiL4hUmmAUL50jL2MkkMkp0qKAoc1_nfCWzg@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Hello,

We have issue with point-in-time recovery. When applying WAL files on
restored postgresql server (we used pg_basebackup to create full cluster
backup) we got "invalid magic number 0000 in log file" error. The recovery
process applied WAL files from archive location and it was set to specific
point in time.

In recovery.conf file were parameters set to:
restore_command = 'cp /db/archive/transaction_logs/%f %p'
recovery_target_time = '2014-06-11 16:45:00'

The recovery stopped at
2014-06-16 15:40:30.514 MDT [32021]: [16580-1] LOG: restored log file
"0000001A00001172000000D1" from archive
2014-06-16 15:40:35.957 MDT [32021]: [16581-1] LOG: restored log file
"0000001A00001172000000D2" from archive
2014-06-16 15:40:46.457 MDT [32021]: [16582-1] LOG: invalid magic number
0000 in log file 4466, segment 210, offset 6594560
2014-06-16 15:40:46.457 MDT [32021]: [16583-1] LOG: redo done at
1172/D2649068
2014-06-16 15:40:46.457 MDT [32021]: [16584-1] LOG: last completed
transaction was at log time 2014-06-11 12:34:13.496717-06
2014-06-16 15:40:46.481 MDT [32021]: [16585-1] LOG: restored log file
"0000001A00001172000000D2" from archive
cp: cannot stat `/db/archive/transaction_logs/0000001B.history': No such
file or directory
2014-06-16 15:40:46.499 MDT [32021]: [16586-1] LOG: selected new timeline
ID: 27
cp: cannot stat `/db/archive/transaction_logs/0000001A.history': No such
file or directory
2014-06-16 15:40:49.766 MDT [32021]: [16587-1] LOG: archive recovery
complete
2014-06-16 15:55:16.171 MDT [6791]: [1-1] LOG: autovacuum launcher started
2014-06-16 15:55:16.172 MDT [32019]: [1-1] LOG: database system is ready
to accept connections

What is interesting, the postgresql server streaming replica/slave, which
is set to asynchronous replication had no issue. When we checked the log
from replica/slave there was no issue with applying the WAL stream on slave.

Streaming replica/slave log:
2014-06-11 12:20:40.832 MDT [11787] - [] - [] - []: [26862-1] LOG:
restartpoint complete: wrote 77860 buffers (7.4%); 0 transaction log
file(s) added, 0 removed, 23 recycled; write=719.761 s, sync=0.983 s,
total=720.748 s; sync files=5119, longest=0.261 s, average=0.000 s
2014-06-11 12:20:40.832 MDT [11787] - [] - [] - []: [26863-1] LOG:
recovery restart point at 1172/6C678380
2014-06-11 12:20:40.832 MDT [11787] - [] - [] - []: [26864-1] DETAIL: last
completed transaction was at log time 2014-06-11 12:20:45.928542-06
2014-06-11 12:23:40.170 MDT [11787] - [] - [] - []: [26865-1] LOG:
restartpoint starting: time
2014-06-11 12:35:43.936 MDT [11787] - [] - [] - []: [26866-1] LOG:
restartpoint complete: wrote 110827 buffers (10.6%); 0 transaction log
file(s) added, 0 removed, 30 recycled; write=719.931 s, sync=3.832 s,
total=723.766 s; sync files=6957, longest=1.349 s, average=0.000 s
2014-06-11 12:35:43.936 MDT [11787] - [] - [] - []: [26867-1] LOG:
recovery restart point at 1172/88B75B48
2014-06-11 12:35:43.936 MDT [11787] - [] - [] - []: [26868-1] DETAIL: last
completed transaction was at log time 2014-06-11 12:35:49.145232-06
2014-06-11 12:35:44.095 MDT [11787] - [] - [] - []: [26869-1] LOG:
restartpoint starting: xlog
2014-06-11 12:47:05.041 MDT [11787] - [] - [] - []: [26870-1] LOG:
restartpoint complete: wrote 176758 buffers (16.9%); 0 transaction log
file(s) added, 0 removed, 28 recycled; write=670.320 s, sync=10.588 s,
total=680.945 s; sync files=7398, longest=3.899 s, average=0.001 s
2014-06-11 12:47:05.041 MDT [11787] - [] - [] - []: [26871-1] LOG:
recovery restart point at 1172/BE5F4928
2014-06-11 12:47:05.041 MDT [11787] - [] - [] - []: [26872-1] DETAIL: last
completed transaction was at log time 2014-06-11 12:47:10.216659-06
2014-06-11 12:47:05.224 MDT [11787] - [] - [] - []: [26873-1] LOG:
restartpoint starting: xlog
2014-06-11 12:58:27.067 MDT [11787] - [] - [] - []: [26874-1] LOG:
restartpoint complete: wrote 495857 buffers (47.3%); 0 transaction log
file(s) added, 0 removed, 54 recycled; write=670.682 s, sync=11.148 s,
total=681.843 s; sync files=5627, longest=4.773 s, average=0.001 s

What could cause this issue? I suspect one possibility could be the WAL
file itself was corrupted when it was archived.
Is it possible to recover/fix WAL file?

--

*Radovan Jablonovsky* | SaaS DBA | Phone 1-403-262-6519 (ext. 256) | Fax
1-403-233-8046

* Replicon | Hassle-Free Time & Expense Management Software - 7,300
Customers - 70 Countrieswww.replicon.com
<http://www.replicon.com/> | facebook
<http://www.facebook.com/Replicon.inc> | twitter
<http://twitter.com/Replicon> | blog
<http://www.replicon.com/blog/> | contact us
<http://www.replicon.com/about_replicon/contact_us.aspx> We are
hiring! | search jobs
<http://tbe.taleo.net/NA2/ats/careers/searchResults.jsp?org=REPLICON&cws=1&act=sort&sortColumn=1&__utma=1.651918544.1299001662.1299170819.1299174966.10&__utmb=1.8.10.1299174966&__utmc=1&__utmx=-&__utmz=1.1299174985.10.3.utmcsr=google%7Cutmccn=(organic)%7Cutmcmd=organic%7Cutmctr=replicon%20careers&__utmv=1.%7C3=Visitor%20Type=Prospects=1,&__utmk=40578466>
*

Browse pgsql-admin by date

  From Date Subject
Next Message Oliver 2014-06-20 11:51:30 Postgresql not getting assigned memory
Previous Message Oliver 2014-06-19 10:44:16 Re: Best backup strategy for production systems