Re: Problem with restoring from backup on 9.0.2

From: Thom Brown <thom(at)linux(dot)com>
To: depesz(at)depesz(dot)com
Cc: pgsql-general(at)postgresql(dot)org
Subject: Re: Problem with restoring from backup on 9.0.2
Date: 2011-01-27 22:23:52
Message-ID: AANLkTimpdusPx4Dvu0C8BBcqBfSMARnEE07_3N8J-WXA@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On 27 December 2010 18:24, hubert depesz lubaczewski <depesz(at)depesz(dot)com> wrote:
> hi,
> this mail will be relatively long, because I need to explain in details
> what/how I do.
>
> Simple summary:
>
> when restoring from hot backup, with out xlogs in pg_xlog/, but instead
> using recovery.conf to get xlogs from wal archive, I get "WAL ends
> before consistent recovery point" in case it shouldn't happen.
>
> Details:
>
> I have following setup:
>
> master pg is sending wal segments to 2 places:
>  - /shared/walarchive - long term storage, compressed content
>  - /shared/4backups   - short term storage, just to do hot backups on slave, not compressed
>
> for sending I use omnipitr-archive, with 2 "-dr" options
>
> slave pg is configured to read wal segments from /shared/4backups (using `cp -i
> ...`) *and* then to switch to streaming replication.
>
> additionally, on slave there is archive cleanup program, which cleans
> /shared/4backups but only if /flags/backup file is not present (which is
> only present when backup happens).
>
> to make backups I use omnipitr-backup-slave, which does:
>
> 1. creates /flags/backup file
> 2. gets snapshot of pg_controldata
> 3. makes backup_label in $PGDATA, where START WAL LOCATION is taken from
> "Latest checkpoint's REDO location" from step #2, and "CHECKPOINT
> LOCATION:" comes from "Latest checkpoint location" in #2
> 4. compresses whole $PGDATA
> 5. gets snapshot of pg_controldata
> 6. waits till new snapshot of pg_controldata (makes another one every 5
> seconds) will have different (from step #4) value in "Latest checkpoint
> location"
> 7. makes .backup file in /shared/4backups that contains the same
> information as #3, but with "STOP WAL LOCATION" taken from "Latest
> checkpoint location" from #6
> 8. compresses /shared/4backups to secondary backup file
>
> afterwards I get 2 files - data.tar.gz and xlog.tar.gz.
>
> Now. I got those files, moved them to 3rd server, uncompressed.
>
> Afterwards I got PGDATA directory, with xlogs in pg_xlog, and no
> recovery.conf.
>
> When starting from this backup - pg starts, reads all necessary segments
> from pg_xlog, finished recovery, and starts without any problem as
> standalone.
>
> but, when i, after extraction, before starting - remove all xlog files,
> and move .backup file from pg_xlog to /shared/walarchive, and setup
> recovery.conf using normal omnipitr-restore, to get wal segments from
> /shared/walarchive - pg starts to behave weirdly.
>
> It starts. Reads all segments from the first mentioned in backup_label
> to the latest available in /shared/walarchvie, and then it fails with
> "WAL ends before consistent recovery point" - but this is *well* past
> the point that it should know that the state is consistent!
>
> Files content from my situation:
>
> =$ cat backup_label.old
> START WAL LOCATION: 33/2213B200 (file 000000010000003300000022)
> CHECKPOINT LOCATION: 33/23101410
> START TIME: 2010-12-27 11:53:44 GMT
> LABEL: OmniPITR_Slave_Hot_Backup
>
> (the file got already renamed to .old by pg recovery process)
>
> =$ zcat /mnt/db/prod/walarchive/000000010000003300000022.0013B200.backup.gz
> START WAL LOCATION: 33/2213B200 (file 000000010000003300000022)
> STOP WAL LOCATION: 33/2C24C800 (file 00000001000000330000002C)
> CHECKPOINT LOCATION: 33/23101410
> START TIME: 2010-12-27 11:53:44 GMT
> START TIME: 2010-12-27 12:06:17 GMT
> LABEL: OmniPITR_Slave_Hot_Backup
>
> last 20 lines from log;
>
> =$ tail -n 20 pg_log/postgresql-2010-12-27_172759.log
> 2010-12-27 17:44:16 UTC  [15768]: [1934-1] LOG:  restored log file "000000010000003A000000B1" from archive
> 2010-12-27 17:44:16 UTC  [15768]: [1935-1] LOG:  restored log file "000000010000003A000000B2" from archive
> 2010-12-27 17:44:16 UTC  [15768]: [1936-1] LOG:  restored log file "000000010000003A000000B3" from archive
> 2010-12-27 17:44:17 UTC  [15768]: [1937-1] LOG:  restored log file "000000010000003A000000B4" from archive
> 2010-12-27 17:44:17 UTC  [15768]: [1938-1] LOG:  restored log file "000000010000003A000000B5" from archive
> 2010-12-27 17:44:18 UTC  [15768]: [1939-1] LOG:  restored log file "000000010000003A000000B6" from archive
> 2010-12-27 17:44:18 UTC  [15768]: [1940-1] LOG:  restored log file "000000010000003A000000B7" from archive
> 2010-12-27 17:44:19 UTC  [15768]: [1941-1] LOG:  restored log file "000000010000003A000000B8" from archive
> 2010-12-27 17:44:19 UTC  [15768]: [1942-1] LOG:  restored log file "000000010000003A000000B9" from archive
> 2010-12-27 17:44:20 UTC  [15768]: [1943-1] LOG:  restored log file "000000010000003A000000BA" from archive
> 2010-12-27 17:44:20 UTC  [15768]: [1944-1] LOG:  restored log file "000000010000003A000000BB" from archive
> 2010-12-27 17:44:21 UTC  [15768]: [1945-1] LOG:  restored log file "000000010000003A000000BC" from archive
> 2010-12-27 17:44:21 UTC  [15768]: [1946-1] LOG:  restored log file "000000010000003A000000BD" from archive
> 2010-12-27 17:44:23 UTC  [15768]: [1947-1] LOG:  could not open file "pg_xlog/000000010000003A000000BE" (log file 58, segment 190): No such file or directory
> 2010-12-27 17:44:23 UTC  [15768]: [1948-1] LOG:  redo done at 3A/BDFFF6E8
> 2010-12-27 17:44:23 UTC  [15768]: [1949-1] LOG:  last completed transaction was at log time 2010-12-27 17:43:23.558028+00
> 2010-12-27 17:44:23 UTC  [15768]: [1950-1] LOG:  restored log file "000000010000003A000000BD" from archive
> 2010-12-27 17:44:23 UTC  [15768]: [1951-1] FATAL:  WAL ends before consistent recovery point
> 2010-12-27 17:44:23 UTC  [15765]: [1-1] LOG:  startup process (PID 15768) exited with exit code 1
> 2010-12-27 17:44:23 UTC  [15765]: [2-1] LOG:  terminating any other active server processes
>
> First 50 lines from log:
>
> =$ head -n 50 pg_log/postgresql-2010-12-27_172759.log
> 2010-12-27 17:27:59 UTC  [15768]: [1-1] LOG:  database system was interrupted while in recovery at log time 2010-12-27 11:54:55 UTC
> 2010-12-27 17:27:59 UTC  [15768]: [2-1] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
> 2010-12-27 17:27:59 UTC  [15768]: [3-1] LOG:  starting archive recovery
> 2010-12-27 17:28:00 UTC  [15768]: [4-1] LOG:  restored log file "000000010000003300000023" from archive
> 2010-12-27 17:28:00 UTC  [15768]: [5-1] LOG:  restored log file "000000010000003300000022" from archive
> 2010-12-27 17:28:00 UTC  [15768]: [6-1] LOG:  redo starts at 33/2213B200
> 2010-12-27 17:28:01 UTC  [15768]: [7-1] LOG:  restored log file "000000010000003300000023" from archive
> 2010-12-27 17:28:01 UTC  [15768]: [8-1] LOG:  restored log file "000000010000003300000024" from archive
> 2010-12-27 17:28:01 UTC  [15768]: [9-1] LOG:  restored log file "000000010000003300000025" from archive
> 2010-12-27 17:28:02 UTC  [15768]: [10-1] LOG:  restored log file "000000010000003300000026" from archive
> 2010-12-27 17:28:02 UTC  [15768]: [11-1] LOG:  restored log file "000000010000003300000027" from archive
> 2010-12-27 17:28:03 UTC  [15768]: [12-1] LOG:  restored log file "000000010000003300000028" from archive
> 2010-12-27 17:28:03 UTC  [15768]: [13-1] LOG:  restored log file "000000010000003300000029" from archive
> 2010-12-27 17:28:03 UTC  [15768]: [14-1] LOG:  restored log file "00000001000000330000002A" from archive
> 2010-12-27 17:28:04 UTC  [15768]: [15-1] LOG:  restored log file "00000001000000330000002B" from archive
> 2010-12-27 17:28:04 UTC  [15768]: [16-1] LOG:  restored log file "00000001000000330000002C" from archive
> 2010-12-27 17:28:04 UTC  [15768]: [17-1] LOG:  restored log file "00000001000000330000002D" from archive
> 2010-12-27 17:28:05 UTC  [15768]: [18-1] LOG:  restored log file "00000001000000330000002E" from archive
> 2010-12-27 17:28:05 UTC  [15768]: [19-1] LOG:  restored log file "00000001000000330000002F" from archive
> 2010-12-27 17:28:06 UTC  [15768]: [20-1] LOG:  restored log file "000000010000003300000030" from archive
> 2010-12-27 17:28:06 UTC  [15768]: [21-1] LOG:  restored log file "000000010000003300000031" from archive
> 2010-12-27 17:28:07 UTC  [15768]: [22-1] LOG:  restored log file "000000010000003300000032" from archive
> 2010-12-27 17:28:07 UTC  [15768]: [23-1] LOG:  restored log file "000000010000003300000033" from archive
> 2010-12-27 17:28:08 UTC  [15768]: [24-1] LOG:  restored log file "000000010000003300000034" from archive
> 2010-12-27 17:28:08 UTC  [15768]: [25-1] LOG:  restored log file "000000010000003300000035" from archive
> 2010-12-27 17:28:08 UTC  [15768]: [26-1] LOG:  restored log file "000000010000003300000036" from archive
> 2010-12-27 17:28:09 UTC  [15768]: [27-1] LOG:  restored log file "000000010000003300000037" from archive
> 2010-12-27 17:28:09 UTC  [15768]: [28-1] LOG:  restored log file "000000010000003300000038" from archive
> 2010-12-27 17:28:10 UTC  [15768]: [29-1] LOG:  restored log file "000000010000003300000039" from archive
> 2010-12-27 17:28:10 UTC  [15768]: [30-1] LOG:  restored log file "00000001000000330000003A" from archive
> 2010-12-27 17:28:11 UTC  [15768]: [31-1] LOG:  restored log file "00000001000000330000003B" from archive
> 2010-12-27 17:28:11 UTC  [15768]: [32-1] LOG:  restored log file "00000001000000330000003C" from archive
> 2010-12-27 17:28:12 UTC  [15768]: [33-1] LOG:  restored log file "00000001000000330000003D" from archive
> 2010-12-27 17:28:12 UTC  [15768]: [34-1] LOG:  restored log file "00000001000000330000003E" from archive
> 2010-12-27 17:28:12 UTC  [15768]: [35-1] LOG:  restored log file "00000001000000330000003F" from archive
> 2010-12-27 17:28:13 UTC  [15768]: [36-1] LOG:  restored log file "000000010000003300000040" from archive
> 2010-12-27 17:28:13 UTC  [15768]: [37-1] LOG:  restored log file "000000010000003300000041" from archive
> 2010-12-27 17:28:14 UTC  [15768]: [38-1] LOG:  restored log file "000000010000003300000042" from archive
> 2010-12-27 17:28:14 UTC  [15768]: [39-1] LOG:  restored log file "000000010000003300000043" from archive
> 2010-12-27 17:28:14 UTC  [15768]: [40-1] LOG:  restored log file "000000010000003300000044" from archive
> 2010-12-27 17:28:15 UTC  [15768]: [41-1] LOG:  restored log file "000000010000003300000045" from archive
> 2010-12-27 17:28:15 UTC  [15768]: [42-1] LOG:  restored log file "000000010000003300000046" from archive
> 2010-12-27 17:28:15 UTC  [15768]: [43-1] LOG:  restored log file "000000010000003300000047" from archive
> 2010-12-27 17:28:16 UTC  [15768]: [44-1] LOG:  restored log file "000000010000003300000048" from archive
> 2010-12-27 17:28:16 UTC  [15768]: [45-1] LOG:  restored log file "000000010000003300000049" from archive
> 2010-12-27 17:28:16 UTC  [15768]: [46-1] LOG:  restored log file "00000001000000330000004A" from archive
> 2010-12-27 17:28:17 UTC  [15768]: [47-1] LOG:  restored log file "00000001000000330000004B" from archive
> 2010-12-27 17:28:17 UTC  [15768]: [48-1] LOG:  restored log file "00000001000000330000004C" from archive
> 2010-12-27 17:28:18 UTC  [15768]: [49-1] LOG:  restored log file "00000001000000330000004D" from archive
> 2010-12-27 17:28:18 UTC  [15768]: [50-1] LOG:  restored log file "00000001000000330000004E" from archive
>
>
> pg_controldata:
>
> =$ /usr/pgsql-9.0/bin/pg_controldata .
> pg_control version number:            903
> Catalog version number:               201008051
> Database system identifier:           5554952960675921391
> Database cluster state:               in archive recovery
> pg_control last modified:             Mon 27 Dec 2010 05:43:36 PM GMT
> Latest checkpoint location:           3A/6377BB8
> Prior checkpoint location:            37/7D93A540
> Latest checkpoint's REDO location:    39/EB0350B8
> Latest checkpoint's TimeLineID:       1
> Latest checkpoint's NextXID:          0/1359777
> Latest checkpoint's NextOID:          294922
> Latest checkpoint's NextMultiXactId:  10752
> Latest checkpoint's NextMultiOffset:  30580
> Latest checkpoint's oldestXID:        654
> Latest checkpoint's oldestXID's DB:   1
> Latest checkpoint's oldestActiveXID:  1306958
> Time of latest checkpoint:            Mon 27 Dec 2010 04:50:53 PM GMT
> Minimum recovery ending location:     3A/B7895958
> Backup start location:                33/2213B200
> Current wal_level setting:            hot_standby
> Current max_connections setting:      1024
> Current max_prepared_xacts setting:   10
> Current max_locks_per_xact setting:   64
> Maximum data alignment:               8
> Database block size:                  8192
> Blocks per segment of large relation: 131072
> WAL block size:                       8192
> Bytes per WAL segment:                16777216
> Maximum length of identifiers:        64
> Maximum columns in an index:          32
> Maximum size of a TOAST chunk:        1996
> Date/time type storage:               64-bit integers
> Float4 argument passing:              by value
> Float8 argument passing:              by value
>
>
> Lines from logs that are *not* about recovered log file:
>
> =$ grep -nvE 'LOG:  restored log file' pg_log/postgresql-2010-12-27_172759.log
> 1:2010-12-27 17:27:59 UTC  [15768]: [1-1] LOG:  database system was interrupted while in recovery at log time 2010-12-27 11:54:55 UTC
> 2:2010-12-27 17:27:59 UTC  [15768]: [2-1] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
> 3:2010-12-27 17:27:59 UTC  [15768]: [3-1] LOG:  starting archive recovery
> 6:2010-12-27 17:28:00 UTC  [15768]: [6-1] LOG:  redo starts at 33/2213B200
> 581:2010-12-27 17:33:00 UTC  [15781]: [1-1] LOG:  restartpoint starting: time
> 666:2010-12-27 17:33:47 UTC  [15781]: [2-1] LOG:  restartpoint complete: wrote 9775 buffers (14.9%); write=38.593 s, sync=5.409 s, total=46.922 s
> 667:2010-12-27 17:33:47 UTC  [15781]: [3-1] LOG:  recovery restart point at 35/13010BE8
> 668:2010-12-27 17:33:47 UTC  [15781]: [4-1] DETAIL:  last completed transaction was at log time 2010-12-27 16:00:12.119251+00
> 1172:2010-12-27 17:38:00 UTC  [15781]: [5-1] LOG:  restartpoint starting: time
> 1276:2010-12-27 17:38:47 UTC  [15781]: [6-1] LOG:  restartpoint complete: wrote 10779 buffers (16.4%); write=44.044 s, sync=2.710 s, total=47.082 s
> 1277:2010-12-27 17:38:47 UTC  [15781]: [7-1] LOG:  recovery restart point at 37/5F01F358
> 1278:2010-12-27 17:38:47 UTC  [15781]: [8-1] DETAIL:  last completed transaction was at log time 2010-12-27 16:32:19.637622+00
> 1787:2010-12-27 17:43:00 UTC  [15781]: [9-1] LOG:  restartpoint starting: time
> 1859:2010-12-27 17:43:36 UTC  [15781]: [10-1] LOG:  restartpoint complete: wrote 6314 buffers (9.6%); write=24.869 s, sync=9.906 s, total=36.024 s
> 1860:2010-12-27 17:43:36 UTC  [15781]: [11-1] LOG:  recovery restart point at 39/EB0350B8
> 1861:2010-12-27 17:43:36 UTC  [15781]: [12-1] DETAIL:  last completed transaction was at log time 2010-12-27 17:07:35.913562+00
> 1959:2010-12-27 17:44:23 UTC  [15768]: [1947-1] LOG:  could not open file "pg_xlog/000000010000003A000000BE" (log file 58, segment 190): No such file or directory
> 1960:2010-12-27 17:44:23 UTC  [15768]: [1948-1] LOG:  redo done at 3A/BDFFF6E8
> 1961:2010-12-27 17:44:23 UTC  [15768]: [1949-1] LOG:  last completed transaction was at log time 2010-12-27 17:43:23.558028+00
> 1963:2010-12-27 17:44:23 UTC  [15768]: [1951-1] FATAL:  WAL ends before consistent recovery point
> 1964:2010-12-27 17:44:23 UTC  [15765]: [1-1] LOG:  startup process (PID 15768) exited with exit code 1
> 1965:2010-12-27 17:44:23 UTC  [15765]: [2-1] LOG:  terminating any other active server processes
>
>
> Since pg on its own starts from this backup, and we do use
> omnipitr-backup-slave on many hosts, and there was never problem with it
> - i'm reluctant to assume the bug is in there.
>
> so, it leaves out pebkac or some bug in pg?
>
> Best regards,
>
> depesz

Depesz, did you ever resolve this?

--
Thom Brown
Twitter: @darkixion
IRC (freenode): dark_ixion
Registered Linux user: #516935

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Jorge Arévalo 2011-01-28 11:16:11 Executing SQL expression from C-functions
Previous Message Lawrence Cohan 2011-01-27 22:02:39 Re: Adding ddl audit trigger