unable to fail over to warm standby server

From: Mason Hale <mason(at)onespot(dot)com>
To: pgsql-bugs(at)postgresql(dot)org
Cc: Simon Riggs <simon(at)2ndquadrant(dot)com>
Subject: unable to fail over to warm standby server
Date: 2010-01-27 23:23:12
Message-ID: 1e85dd391001271523i2348d739y998bdd814b6a50c0@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

Hello --

We are using PostgreSQL 8.3.8 with a Warm Standy (PITR) setup.

Recently we experienced a failure on our primary database server and
when we attempted to fail over to the standby server it would not come
up.
This configuration has been tested previously (we've successfully
transferred from primary to secondary at least 3 times in the past).

Our system is up and running with the Warm Standby setup reconfigured
and running again.

I'm reporting this issue as bug to describe the circumstances of this
failed recovery in the event it does indicates a bug in this very
critical feature of PostgreSQL.

Both servers are identically configured as follows:
- Dual QuadCore Intel Xeon-Harpertown 5430 @ 2.66GHz
- Redhat EL 5.1-64
- 64GB RAM
- 250GB - 2-disk RAID 0 for / partition
- 1.1TB - 8 disk RAID 10 configuration for /data
- 72GB - 2-disk RAID 0 for /pg_xlog

It is also probably worth mentioning that Simon Riggs (copied)
consulted on the configuration of our warm standby setup, thus we are
fairly confident it was setup correctly. ;-)

Archive configuration on primary server (prod-db-1) was:

archive_mode = on
archive_command = 'rsync -az --rsh=ssh %p prod-db-2.vpn:wal_archive/%f
</dev/null'
archive_timeout = 120

Restore configuration (in recovery.conf) on secondary server (prod-db-2) was:

restore_command = 'pg_standby -t /tmp/pgsql.trigger.5432
/home/postgres/wal_archive %f %p %r 2>>pg_standby.log'

Timeline of events:

1.) At 20:58 UTC -- received alert that primary server was not
responding to pings

FYI the last lines in the postgresql.log file on the primary server
(before it crashed) were:

2010-01-18 20:40:56 UTC (10.16.74.194)LOG: unexpected EOF on client connection
2010-01-18 20:40:59 UTC ()LOG: checkpoint starting: time
2010-01-18 20:45:56 UTC (10.16.74.200)LOG: unexpected EOF on client connection
2010-01-18 20:50:57 UTC (10.12.58.142)LOG: unexpected EOF on client connection
2010-01-18 20:51:45 UTC ()LOG: checkpoint complete: wrote 45066
buffers (8.6%); 0 transaction log file(s) added, 0 removed, 44
recycled; write=455.622 s, sync=190.549 s, total=646.443 s

- sysadmin confirmed primary was offline, and started manual process
to transition to standby server,
while another sysadmin worked simultaneously on trying to bring the
primary server back online.

2.) Running `ps aufx` on the standby (prod-db-2) included this output

postgres 16323 0.0 0.1 4578680 98180 ? S 2009 0:00
/usr/local/pgsql-8.3.8/bin/postgres -D /data/pgdata-8.3
postgres 16324 0.0 0.0 71884 716 ? Ss 2009 0:09 \_
postgres: logger process
postgres 16325 3.4 6.5 4581480 4284404 ? Ss 2009 3902:25 \_
postgres: startup process waiting for 0000000200003C82000000D8
postgres 9164 0.0 0.0 63772 996 ? S 20:56 0:00
\_ sh -c pg_standby -t /tmp/pgsql.trigger.5432
/home/postgres/wal_archive 0000000200003C82000000D
postgres 9165 0.0 0.0 19168 616 ? S 20:56 0:00
\_ pg_standby -t /tmp/pgsql.trigger.5432 /home/postgres/wal_archive
0000000200003C82000000D8

Notice the standby server is waiting on WAL file: 0000000200003C82000000D8

3.) Running `ls -l /home/postgres/wal_archive` showed that the most
recent files transferred were

-rw------- 1 postgres postgres 16777216 Jan 18 20:55 0000000200003C82000000D5
-rw------- 1 postgres postgres 16777216 Jan 18 20:55 0000000200003C82000000D6
-rw------- 1 postgres postgres 16777216 Jan 18 20:56 0000000200003C82000000D7

4.) The sysadmin ran the following command to make the secondary
server exit recovery mode and come online:

sudo touch /tmp/pgsql.trigger.5432

NOTE: The use of 'sudo' here is a potential cause/contributor of the
failure, because the file pgsql.trigger.5432 will be owned by root,
yet
the postmaster process starts as user 'postgres'. However a permission
issue is inconsistent with what we saw next.

5.) After a few seconds the postgres process stopped but did not come backup.
The last few lines of the postgresql.log file were:

2010-01-18 20:54:28 UTC ()LOG: restored log file
"0000000200003C82000000D4" from archive
2010-01-18 20:55:03 UTC ()LOG: restored log file
"0000000200003C82000000D5" from archive
2010-01-18 20:55:37 UTC ()LOG: restored log file
"0000000200003C82000000D6" from archive
2010-01-18 20:56:20 UTC ()LOG: restored log file
"0000000200003C82000000D7" from archive
2010-01-18 21:08:31 UTC ()FATAL: could not restore file
"0000000200003C82000000D8" from archive: return code 65280
2010-01-18 21:08:31 UTC ()LOG: startup process (PID 16325) exited
with exit code 1
2010-01-18 21:08:31 UTC ()LOG: aborting startup due to startup process failure

Looking in the /home/postgres/wal_archive directory again confirmed
that it did not contain any file named 0000000200003C82000000D8

This was the most mysterious part of the puzzle as the database
apparently aborted with a FATAL error when trying to restore a file
that didn't exist.

We've searched Google and postgresql.org to determine what return code
65280 means, but have come up empty.

6.) Sysadmin attempted to restart the server by hand using `pg_ctl -D
/data/pgdata-8.3/ start` -- this command returned a "server starting"
message,
but the server never appeared in the process list.

7.) The contents of the newly created postgresql.log were:

tail postgresql-2010-01-18_210946.log
2010-01-18 21:09:46 UTC ()LOG: database system was interrupted while
in recovery at log time 2010-01-18 20:10:59 UTC
2010-01-18 21:09:46 UTC ()HINT: If this has occurred more than once
some data might be corrupted and you might need to choose an earlier
recovery target.
2010-01-18 21:09:46 UTC ()LOG: starting archive recovery
2010-01-18 21:09:46 UTC ()LOG: restore_command = 'pg_standby -t
/tmp/pgsql.trigger.5432 /home/postgres/wal_archive %f %p %r
2>>pg_standby.log'
2010-01-18 21:09:46 UTC ()FATAL: could not restore file
"00000002.history" from archive: return code 65280
2010-01-18 21:09:46 UTC ()LOG: startup process (PID 9284) exited with
exit code 1
2010-01-18 21:09:46 UTC ()LOG: aborting startup due to startup process failure

8.) Sysadmin noticed the 'recovery.conf' file was still in the db root
directory (it had not been renamed recovery.done).
The recovery.conf file was renamed to recovery.conf.old and running
`pg_ctl -D /data/pgdata-8.3/ start` attempted again

9.) The server did not come up (again). This time the contents of the
new postgresql.log file were:

[postgres(at)prod-db-2 pg_log]$ tail -n 100 postgresql-2010-01-18_211132.log
2010-01-18 21:11:32 UTC ()LOG: database system was interrupted while
in recovery at log time 2010-01-18 20:10:59 UTC
2010-01-18 21:11:32 UTC ()HINT: If this has occurred more than once
some data might be corrupted and you might need to choose an earlier
recovery target.
2010-01-18 21:11:32 UTC ()LOG: could not open file
"pg_xlog/0000000200003C82000000A3" (log file 15490, segment 163): No
such file or directory
2010-01-18 21:11:32 UTC ()LOG: invalid primary checkpoint record
2010-01-18 21:11:32 UTC ()LOG: could not open file
"pg_xlog/0000000200003C8200000049" (log file 15490, segment 73): No
such file or directory
2010-01-18 21:11:32 UTC ()LOG: invalid secondary checkpoint record
2010-01-18 21:11:32 UTC ()PANIC: could not locate a valid checkpoint record
2010-01-18 21:11:32 UTC ()LOG: startup process (PID 9328) was
terminated by signal 6: Aborted
2010-01-18 21:11:32 UTC ()LOG: aborting startup due to startup process failure

10.) At this point a lot of time was spent on the #postgresql IRC
channel looking for help diagnosing this problem.
Before the issue was resolved, the primary server was brought back
online. Soon after we decided to create a new
base backup on the standby server and aborted trying to recover.

FYI: The last few lines of pg_standby.log were:

[postgres(at)prod-db-2 pgdata-8.3]$ tail -n 300 pg_standby.log
trigger file found
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
cp: cannot stat `/home/postgres/wal_archive/00000002.history': No such
file or directory
trigger file found

ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not
permittedtrigger file found

ERROR: could not remove "/tmp/pgsql.trigger.5432": Operation not permitted

This file was not looked until after the attempt to recover was
aborted. Clearly the permissions on /tmp/pgsql.trigger.5432 were a
problem,
but we don't see how that would explain the error messages, which seem
to indicate that data on the standby server was corrupted.

If you need additional information we have saved console logs and logs
files from both the primary and secondary servers.

Thanks for reading this far and any help in tracking down the cause of
this unexpected failure.

cheers,

- Mason

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Mike Bresnahan 2010-01-27 23:29:09 Re: Amazon EC2 CPU Utilization
Previous Message Mike Bresnahan 2010-01-27 23:15:45 Re: Amazon EC2 CPU Utilization