Warm standby stall -- what debug info would help?

From: "Kevin Grittner" <Kevin(dot)Grittner(at)wicourts(dot)gov>
To: <pgsql-hackers(at)postgresql(dot)org>
Subject: Warm standby stall -- what debug info would help?
Date: 2007-07-09 15:54:51
Message-ID: 469213FB.EE98.0025.0@wicourts.gov
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-hackers

We're running a number of warm standby instances on one server. One of them stalled on Saturday. When I found that this morning, I confirmed that the files were in the directory from which it should be pulling the WAL files. The logs showed normal processing up to the stall, with no messages afterwards. I tried a restart and it resumed warm standby status and caught up quickly.

It seems like this is probably a PostgreSQL bug of some sort, although maybe someone can spot a problem in our recovery script. If it happens again, what information should I gather before the restart to help find the cause?

-Kevin

PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # tail postgresql-2007-07-07_000000.log
[2007-07-07 09:44:41.392 CDT] 5962 LOG: restored log file "000000010000000C000000D2" from archive
[2007-07-07 12:24:53.597 CDT] 5962 LOG: restored log file "000000010000000C000000D3" from archive
[2007-07-07 12:24:53.984 CDT] 5962 LOG: restored log file "000000010000000C000000D4" from archive
[2007-07-07 12:24:54.351 CDT] 5962 LOG: restored log file "000000010000000C000000D5" from archive
[2007-07-07 14:10:42.208 CDT] 5962 LOG: restored log file "000000010000000C000000D6" from archive
[2007-07-07 14:10:42.634 CDT] 5962 LOG: restored log file "000000010000000C000000D7" from archive
[2007-07-07 15:23:41.717 CDT] 5962 LOG: restored log file "000000010000000C000000D8" from archive
[2007-07-07 18:24:26.933 CDT] 5962 LOG: restored log file "000000010000000C000000D9" from archive
[2007-07-07 18:24:27.692 CDT] 5962 LOG: restored log file "000000010000000C000000DA" from archive
[2007-07-07 18:24:28.051 CDT] 5962 LOG: restored log file "000000010000000C000000DB" from archive
PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # cat postgresql-2007-07-09_000000.log
[2007-07-09 08:21:50.200 CDT] 5904 LOG: received fast shutdown request
[2007-07-09 08:21:50.201 CDT] 5962 FATAL: could not restore file "000000010000000C000000DC" from archive: return code 15
[2007-07-09 08:21:50.485 CDT] 5904 LOG: startup process (PID 5962) exited with exit code 1
[2007-07-09 08:21:50.485 CDT] 5904 LOG: aborting startup due to startup process failure
[2007-07-09 08:21:50.555 CDT] 5960 LOG: logger shutting down
PGBACKUP:/var/pgsql/data/county/ozaukee/data/pg_log # cat postgresql-2007-07-09_082151.log | grep -v 'starting up'
[2007-07-09 08:21:51.718 CDT] 19076 LOG: database system was interrupted while in recovery at log time 2007-07-07 15:00:02 CDT
[2007-07-09 08:21:51.718 CDT] 19076 HINT: If this has occurred more than once some data may be corrupted and you may need to choose an earlier recovery target.
[2007-07-09 08:21:51.718 CDT] 19076 LOG: starting archive recovery
[2007-07-09 08:21:51.725 CDT] 19076 LOG: restore_command = "/usr/local/backup/recovery.sh %f %p"
[2007-07-09 08:21:52.079 CDT] 19076 LOG: restored log file "000000010000000C000000D9" from archive
[2007-07-09 08:21:52.079 CDT] 19076 LOG: checkpoint record is at C/D9000020
[2007-07-09 08:21:52.079 CDT] 19076 LOG: redo record is at C/D9000020; undo record is at 0/0; shutdown FALSE
[2007-07-09 08:21:52.079 CDT] 19076 LOG: next transaction ID: 0/10700115; next OID: 1387338
[2007-07-09 08:21:52.079 CDT] 19076 LOG: next MultiXactId: 1; next MultiXactOffset: 0
[2007-07-09 08:21:52.079 CDT] 19076 LOG: automatic recovery in progress
[2007-07-09 08:21:52.081 CDT] 19076 LOG: redo starts at C/D9000068
[2007-07-09 08:21:52.429 CDT] 19076 LOG: restored log file "000000010000000C000000DA" from archive
[2007-07-09 08:21:52.860 CDT] 19076 LOG: restored log file "000000010000000C000000DB" from archive

PGBACKUP:/var/pgsql/data/county/ozaukee/data # grep -Ev '^([[:space:]]+)?($|#)' postgresql.conf
listen_addresses = '*' # what IP address(es) to listen on;
port = 5445 # (change requires restart)
max_connections = 50 # (change requires restart)
shared_buffers = 160MB # min 128kB or max_connections*16kB
temp_buffers = 10MB # min 800kB
work_mem = 30MB # min 64kB
maintenance_work_mem = 160MB # min 1MB
max_fsm_pages = 204800 # min max_fsm_relations*16, 6 bytes each
bgwriter_lru_percent = 20.0 # 0-100% of LRU buffers scanned/round
bgwriter_lru_maxpages = 200 # 0-1000 buffers max written/round
bgwriter_all_percent = 10.0 # 0-100% of all buffers scanned/round
bgwriter_all_maxpages = 600 # 0-1000 buffers max written/round
wal_buffers = 1MB # min 32kB
checkpoint_segments = 10 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 30min # range 30s-1h
archive_timeout = 3600 # force a logfile segment switch after this
seq_page_cost = 0.1 # measured on an arbitrary scale
random_page_cost = 0.1 # same scale as above
effective_cache_size = 3GB
redirect_stderr = on # Enable capturing of stderr into log
log_line_prefix = '[%m] %p %q<%u %d %r> ' # Special values:
stats_block_level = on
stats_row_level = on
autovacuum = on # enable autovacuum subprocess?
autovacuum_naptime = 10s # time between autovacuum runs
autovacuum_vacuum_threshold = 1 # min # of tuple updates before
autovacuum_analyze_threshold = 1 # min # of tuple updates before
datestyle = 'iso, mdy'
lc_messages = 'C' # locale for system error message
lc_monetary = 'C' # locale for monetary formatting
lc_numeric = 'C' # locale for number formatting
lc_time = 'C' # locale for time formatting
escape_string_warning = off
standard_conforming_strings = on
sql_inheritance = off

PGBACKUP:/var/pgsql/data/county/ozaukee/data # pg_controldata /var/pgsql/data/county/ozaukee/data/
pg_control version number: 822
Catalog version number: 200611241
Database system identifier: 5062210045799347511
Database cluster state: in archive recovery
pg_control last modified: Mon 09 Jul 2007 08:21:52 AM CDT
Current log file ID: 8
Next log file segment: 150
Latest checkpoint location: C/D9000020
Prior checkpoint location: C/D9000020
Latest checkpoint's REDO location: C/D9000020
Latest checkpoint's UNDO location: C/D9000020
Latest checkpoint's TimeLineID: 1
Latest checkpoint's NextXID: 0/10700115
Latest checkpoint's NextOID: 1387338
Latest checkpoint's NextMultiXactId: 1
Latest checkpoint's NextMultiOffset: 0
Time of latest checkpoint: Sat 07 Jul 2007 03:00:02 PM CDT
Minimum recovery ending location: 8/9500A530
Maximum data alignment: 4
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Date/time type storage: 64-bit integers
Maximum length of locale name: 128
LC_COLLATE: C
LC_CTYPE: C

PGBACKUP:/var/pgsql/data/county/ozaukee/data # cat /usr/local/backup/recovery.sh
#! /bin/bash
# Pick out county name from the back of the path.
# The value of $PWD will be: /var/pgsql/data/county/<countyName>/data
countyName=`dirname $PWD`
countyName=`basename $countyName`
while [ ! -f /var/pgsql/data/county/$countyName/wal-files/$1.gz \
-a ! -f /var/pgsql/data/county/$countyName/DONE \
-o -f /var/pgsql/data/county/$countyName/wal-files/rsync-in-progress ]
do
if ! ( echo "$1" | grep -Eq '^[0-9A-F]{24}$' ) ; then
exit 1
fi
sleep 10 # /* wait for ~10 sec */
done
gunzip < /var/pgsql/data/county/$countyName/wal-files/$1.gz > "$2"

Responses

Browse pgsql-hackers by date

  From Date Subject
Next Message Jim C. Nasby 2007-07-09 16:28:16 Re: Idea: Comments on system catalogs?
Previous Message Tom Lane 2007-07-09 14:52:15 Weird ecpg failures on buildfarm NetBSD members