From: | Lee Azzarello <lee(at)dropio(dot)com> |
---|---|
To: | pgsql-admin(at)postgresql(dot)org |
Subject: | Warm standby recovery failure |
Date: | 2009-01-29 19:47:14 |
Message-ID: | 16b031c0901291147m17c9e176occ73994d2706f454@mail.gmail.com |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-admin |
I'm new to the list. So hello everyone. This morning my monitoring
system sent me an alert that my warm standby database had exited
recovery mode. I checked on it when I woke up and sure enough, it had
stopped looking for new WAL files and became available for queries.
After some digging, I uncovered these details, in a chronological
order:
*NAGIOS*
Date/Time: Thu Jan 29 12:36:04 UTC 2009
DISK CRITICAL - free space: /var/lib/postgresql 35 MB (0% inode=99%):
** MYSTERY PERIOD **
(two crontabs to do WAL maintance are scheduled for 04:00, not
applicable to this situation)
*NAGIOS*
Date/Time: Thu Jan 29 12:49:05 UTC 2009
DISK OK - free space: /var/lib/postgresql 14129 MB (29% inode=99%)
* me working in a shell on the host *
# date
Thu Jan 29 15:33:35 UTC 2009
# df -h /dev/sdc1
Filesystem Size Used Avail Use% Mounted on
/dev/sdc1 50G 3.3G 44G 7% /var/lib/postgresql
/var/log/postgresql/postgresql-8.3-main.log
2009-01-29 12:41:36 UTC LOG: restored log file
"000000010000002F000000A9" from archive
size= [16777216]
copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG'
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/./000000010000002F000000AA'
cp: writing `pg_xlog/./000000010000002F000000AA': No space left on device
2009-01-29 12:48:14 UTC LOG: could not read from log file 47, segment
170, offset 3129344: Success
2009-01-29 12:48:14 UTC LOG: redo done at 2F/AA2FBE08
2009-01-29 12:48:14 UTC LOG: last completed transaction was at log
time 2009-01-29 12:42:50.572406+00
size= [16777216]
copy /mnt/pg_xlog/000000010000002F000000AA pg_xlog/RECOVERYXLOG
`/mnt/pg_xlog/000000010000002F000000AA' -> `pg_xlog/RECOVERYXLOG'
cp: overwrite `pg_xlog/./000000010000002F000000AA'? 2009-01-29
12:48:14 UTC LOG: restored log file "000000010000002F000000AA" from
archive
pg_xlog/RECOVERYHISTORY
2009-01-29 12:48:14 UTC LOG: selected new timeline ID: 2
pg_xlog/RECOVERYHISTORY
2009-01-29 12:48:14 UTC LOG: archive recovery complete
# /etc/init.d/postgresql-8.3 start
...
# ps ax | grep postgres
3499 ? Ss 0:00 postgres: startup process waiting for
000000020000002F000000AA
3503 ? S 0:00 /bin/bash
/var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG
000000020000002F000000AA
WTF? (before I knew about timelines)
# ls -l /mnt/pg_xlog/000000020000002F000000AA
ls: cannot access /mnt/pg_xlog/000000020000002F000000AA: No such file
or directory
# updatedb
# locate 000000020000002F000000AA
(nothing)
# rm /var/lib/postgresql/8.3/main/pg_xlog/0*
(perhaps a bad move? I still have these in /mnt/pg_xlog)
# /etc/init.d/postgresql-8.3 restart
Same problem. WTF?
# rm /var/lib/postgresql/8.3/main/pg_xlog/archive_status/00000002.history.ready
# /etc/init.d/postgresql-8.3 restart
# ps ax | grep postgres
3499 ? Ss 0:00 postgres: startup process waiting for
000000020000002F000000AA
3503 ? S 0:00 /bin/bash
/var/lib/postgresql/bin/recover.sh pg_xlog/RECOVERYXLOG
000000020000002F000000AA
Why is it looking for 000000020000002F000000AA? It should be looking
for 000000010000002F000000AA.
This seems to be a history problem with timelines. Recovery branched
into a new timeline, which is where it is now. I need to force it back
to the previous timeline so it can resume recovery. I'm building a
clean recovery database now but I'd like to get to the bottom of this
and bring the failed recovery database back into recovery mode so I
can learn how to get out of this situation more gracefully.
Thanks for any help you can offer.
_____________
Lee Azzarello
Staff Hacker
drop.io
From | Date | Subject | |
---|---|---|---|
Next Message | Tom Lane | 2009-01-30 01:06:41 | Re: [GENERAL] Encoding problem using pg_dumpall |
Previous Message | Michael Monnerie | 2009-01-29 19:14:28 | Re: [GENERAL] Encoding problem using pg_dumpall |