BUG #7753: Cannot promote out of hot standby

From: daniel(at)heroku(dot)com
To: pgsql-bugs(at)postgresql(dot)org
Subject: BUG #7753: Cannot promote out of hot standby
Date: 2012-12-12 21:36:23
Message-ID: E1TityZ-0000Oz-6z@wrigleys.postgresql.org
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-bugs

The following bug has been logged on the website:

Bug reference: 7753
Logged by: Daniel Farina
Email address: daniel(at)heroku(dot)com
PostgreSQL version: 9.1.6
Operating system: Ubuntu 10.04
Description:

Touching a trigger file will not cause promotion out of hot standby.
Basically, an apparently normally-working hot-standby database will
not leave hot standby. The database emitting WAL is version 9.1.4.

Everything appears normal in the log (downloads and restoring of
archived segments), and the server seems to take no notice of the
trigger file.

To force the issue, I introduced an error into the configuration of
the restoration program to cause it to exit. Normally that's no
problem; postgres would just keep on trying to restore a segment over
and over until the error is fixed.

Instead, the server crashes:

[413-1] [COPPER] LOG: restored log file "000000010000034D00000050" from
archive
wal_e.worker.s3_worker INFO MSG: completed download and
decompression#012 DETAIL: Downloaded and decompressed
"s3://archive-root/wal_005/000000010000034D00000051.lzo" to
"pg_xlog/RECOVERYXLOG"
[414-1] [COPPER] LOG: restored log file "000000010000034D00000051" from
archive
wal_e.worker.s3_worker INFO MSG: completed download and
decompression#012 DETAIL: Downloaded and decompressed
"s3://archive-root/wal_005/000000010000034D00000052.lzo" to
"pg_xlog/RECOVERYXLOG"
[415-1] [COPPER] LOG: restored log file "000000010000034D00000052" from
archive

# I introduce the failure here

wal_e.main ERROR MSG: no AWS_SECRET_ACCESS_KEY defined#012 HINT:
Define the environment variable AWS_SECRET_ACCESS_KEY.
LOG: trigger file found: /etc/postgresql/wal-e.d/pull-env/STANDBY_OFF
LOG: redo done at 34D/52248590
LOG: last completed transaction was at log time 2012-12-10 wal_e.main
ERROR MSG: no AWS_SECRET_ACCESS_KEY defined#012 HINT: Define the
environment variable AWS_SECRET_ACCESS_KEY.
PANIC: could not open file "pg_xlog/000000010000034D00000052" (log file
845, segment 82): No such file or directory
LOG: startup process (PID 7) was terminated by signal 6: Aborted
LOG: terminating any other active server processes
WARNING: terminating connection because of crash of another server process
WARNING: terminating connection because of crash of another server process
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.
DETAIL: The postmaster has commanded this server process to roll back the
current transaction and exit, because another server process exited
abnormally and possibly corrupted shared memory.

I can fix the configuration and restart the server, and everything is
as fine as before. Next, I try removing recovery.conf and restarting
the server as an alternative way of promoting...but, no avail;
however, a slightly different error message:

# Server begins starting
LOG: loaded library "auto_explain"
LOG: loaded library "pg_stat_statements"
LOG: database system was interrupted while in recovery at log time
2012-12-10 15:20:03 UTC
HINT: If this has occurred more than once some data might be corrupted and
you might need to choose an earlier recovery target.
LOG: could not open file "pg_xlog/000000010000034E0000001A" (log file 846,
segment 26): No such file or directory
LOG: invalid primary checkpoint record
LOG: could not open file "pg_xlog/000000010000034D000000F2" (log file 845,
segment 242): No such file or directory
LOG: invalid secondary checkpoint record
PANIC: could not locate a valid checkpoint record
LOG: startup process (PID 7) was terminated by signal 6: Aborted
LOG: aborting startup due to startup process failure
main process (24284) terminated with status 1

pg_control looks like this around the same time, for reference:

pg_control version number: 903
Catalog version number: 201105231
Database cluster state: in archive recovery
pg_control last modified: Wed 12 Dec 2012 09:22:30 PM UTC
Latest checkpoint location: 351/1FE194C0
Prior checkpoint location: 351/FD64A78
Latest checkpoint's REDO location: 351/131848C8
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/652342033
Latest checkpoint's NextOID: 103224
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Latest checkpoint's oldestXID: 455900714
Latest checkpoint's oldestXID's DB: 16385
Latest checkpoint's oldestActiveXID: 652311442
Time of latest checkpoint: Mon 10 Dec 2012 07:19:23 PM UTC
Minimum recovery ending location: 351/4BFFFE20
Backup start location: 0/0
Current wal_level setting: hot_standby
Current max_connections setting: 500
Current max_prepared_xacts setting: 500
Current max_locks_per_xact setting: 64
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Date/time type storage: 64-bit integers
Float4 argument passing: by value
Float8 argument passing: by value

In the course of all this messing around, it has never been a problem to go
back to archive recovery.

Responses

Browse pgsql-bugs by date

  From Date Subject
Next Message Heikki Linnakangas 2012-12-14 11:58:13 Bug with temporary child of a permanent table after recovery
Previous Message Kasia Tuszynska 2012-12-12 17:29:35 Re: BUG #7751: libintl.h missing in the include folder