WAL archiving file name collision

From: Berge Schwebs Bjørlo <berge(at)trivini(dot)no>
To: pgsql-general(at)postgresql(dot)org
Subject: WAL archiving file name collision
Date: 2009-09-09 10:17:33
Message-ID: 20090909101732.GD13172@trivini.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Greetings!

We've got a moderately busy DB host running PostgreSQL 8.3.7 from Debian Lenny
with WAL archiving. It's been rsyncing its WAL files to a backup host for the
last year or so, with the following archive_command:

ssh pgbackup(at)backup test ! -f pgbackup-cirkus/%f && rsync -az %p pgbackup(at)backup:pgbackup-cirkus/%f

Recently, WAL archiving begain failing on the test which checks wether the
file exists. This first occured two hours after an incident where someone
edited pg_hba.conf and left it with permissions denying Postgres read access
to it. Upon SIGHUP the cluster naturally shut down. It was discovered
promptly, and according to this person, there were "some processes named
postgres still running". He ran "/etc/init.d/postgresql-8.3 start" anyway,
which brought up the cluster:

2009-09-07 20:39:55 CEST 5782 LOG: received SIGHUP, reloading configuration files
2009-09-07 20:39:55 CEST 5782 FATAL: could not open configuration file "/etc/postgresql/8.3/main/pg_hba.conf": Permission denied
2009-09-07 20:40:07 CEST 14398 mdb2 mdb2web LOG: could not receive data from client: Connection reset by peer
2009-09-07 20:40:07 CEST 14398 mdb2 mdb2web LOG: unexpected EOF on client connection
2009-09-07 20:40:15 CEST 14485 mdb2 billig_web LOG: could not receive data from client: Connection reset by peer
2009-09-07 20:40:15 CEST 14485 mdb2 billig_web LOG: unexpected EOF on client connection
2009-09-07 20:41:29 CEST 16197 LOG: could not load root certificate file "root.crt": no SSL error reported
2009-09-07 20:41:29 CEST 16197 DETAIL: Will not verify client certificates.
2009-09-07 20:41:29 CEST 16197 FATAL: could not open configuration file "/etc/postgresql/8.3/main/pg_hba.conf": Permission denied
2009-09-07 20:42:04 CEST 16748 LOG: could not load root certificate file "root.crt": no SSL error reported
2009-09-07 20:42:04 CEST 16748 DETAIL: Will not verify client certificates.
2009-09-07 20:42:04 CEST 16749 LOG: database system was interrupted; last known up at 2009-09-07 20:37:38 CEST
2009-09-07 20:42:04 CEST 16749 LOG: database system was not properly shut down; automatic recovery in progress
2009-09-07 20:42:04 CEST 16749 LOG: redo starts at 65/F00718D0
2009-09-07 20:42:04 CEST 16749 LOG: record with zero length at 65/F1039488
2009-09-07 20:42:04 CEST 16749 LOG: redo done at 65/F1039458
2009-09-07 20:42:04 CEST 16749 LOG: last completed transaction was at log time 2009-09-07 20:39:52.010594+02
2009-09-07 20:42:04 CEST 16749 LOG: checkpoint starting: shutdown immediate
2009-09-07 20:42:04 CEST 16750 [unknown] [unknown] LOG: connection received: host=[local]
2009-09-07 20:42:04 CEST 16750 [unknown] [unknown] LOG: incomplete startup packet
2009-09-07 20:42:04 CEST 16749 LOG: checkpoint complete: wrote 43 buffers (0.2%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=0.001 s, sync=0.429 s, total=0.464 s
2009-09-07 20:42:04 CEST 16756 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1917 port=50140
2009-09-07 20:42:04 CEST 16749 LOG: recovering prepared transaction 809084
2009-09-07 20:42:04 CEST 16756 mdb2 uka_web FATAL: the database system is starting up
2009-09-07 20:42:04 CEST 16757 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1917 port=50141
2009-09-07 20:42:04 CEST 16757 mdb2 uka_web FATAL: the database system is starting up
2009-09-07 20:42:05 CEST 16762 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1919 port=36523
2009-09-07 20:42:05 CEST 16762 greylisting exim FATAL: the database system is starting up
2009-09-07 20:42:05 CEST 16763 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1919 port=36524
2009-09-07 20:42:05 CEST 16763 greylisting exim FATAL: the database system is starting up
2009-09-07 20:42:05 CEST 16764 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1919 port=36525
2009-09-07 20:42:05 CEST 16764 greylisting exim FATAL: the database system is starting up
2009-09-07 20:42:05 CEST 16765 [unknown] [unknown] LOG: connection received: host=2001:700:300:1800::1919 port=36526
2009-09-07 20:42:05 CEST 16765 greylisting exim FATAL: the database system is starting up
2009-09-07 20:42:05 CEST 16768 [unknown] [unknown] LOG: connection received: host=[local]
2009-09-07 20:42:05 CEST 16768 postgres postgres FATAL: the database system is starting up
2009-09-07 20:42:05 CEST 16776 LOG: autovacuum launcher started
2009-09-07 20:42:05 CEST 16748 LOG: database system is ready to accept connections
2009-09-07 20:42:05 CEST 16786 [unknown] [unknown] LOG: connection received: host=[local]
2009-09-07 20:42:05 CEST 16786 postgres postgres LOG: connection authorized: user=postgres database=postgres

The cluster recovered, which strikes me as odd - shouldn't Postgres at least
rollback uncommited transactions and shut down cleanly upon an unreadable
configuration file?

Anyway, the cluster ran fine, handling several transactions per second, until
one and a half hour later, where WAL archiving suddenly stopped. (The cluster
runs on happily anyway, though.)

2009-09-07 22:12:53 CEST 2691 mdb2_ukadev uka_mdb LOG: statement: ABORT
2009-09-07 22:12:54 CEST 16774 LOG: checkpoint complete: wrote 279 buffers (1.4%); 0 transaction log file(s) added, 0 removed, 6 recycled; write=48.665 s, sync=0.631 s, total=49.339 s
2009-09-07 22:12:54 CEST 16777 LOG: archive command failed with exit code 1
2009-09-07 22:12:54 CEST 16777 DETAIL: The failed archive command was: ssh pgbackup(at)backup test ! -f pgbackup-cirkus/0000000100000065000000F1 && rsync -az pg_xlog/0000000100000065000000F1 pgbackup(at)backup:pgbackup-cirkus/0000000100000065000000F1

[..]

2009-09-07 22:12:55 CEST 16777 LOG: archive command failed with exit code 1
2009-09-07 22:12:55 CEST 16777 DETAIL: The failed archive command was: ssh pgbackup(at)backup test ! -f pgbackup-cirkus/0000000100000065000000F1 && rsync -az pg_xlog/0000000100000065000000F1 pgbackup(at)backup:pgbackup-cirkus/0000000100000065000000F1

[..]

2009-09-07 22:12:56 CEST 16777 LOG: archive command failed with exit code 1
2009-09-07 22:12:56 CEST 16777 DETAIL: The failed archive command was: ssh pgbackup(at)backup test ! -f pgbackup-cirkus/0000000100000065000000F1 && rsync -az pg_xlog/0000000100000065000000F1 pgbackup(at)backup:pgbackup-cirkus/0000000100000065000000F1
2009-09-07 22:12:56 CEST 16777 WARNING: transaction log file "0000000100000065000000F1" could not be archived: too many failures

On the backup host (receiving the WAL files), named kolje:

pgbackup(at)kolje:~$ ls -la pgbackup-cirkus/0000000100000065000000EE pgbackup-cirkus/0000000100000065000000EF pgbackup-cirkus/0000000100000065000000F0 pgbackup-cirkus/0000000100000065000000F1 pgbackup-cirkus/0000000100000065000000F2 pgbackup-cirkus/0000000100000065000000F4
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:27 pgbackup-cirkus/0000000100000065000000EE
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:32 pgbackup-cirkus/0000000100000065000000EF
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:37 pgbackup-cirkus/0000000100000065000000F0
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:47 pgbackup-cirkus/0000000100000065000000F1
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 20:52 pgbackup-cirkus/0000000100000065000000F2
-rw------- 1 pgbackup pgbackup 16777216 2009-09-07 21:02 pgbackup-cirkus/0000000100000065000000F4

On the DB host, called cirkus:

cirkus:~# ls -la /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1 /var/lib/postgresql/8.3/main/pg_xlog/archive_status/0000000100000065000000F1.ready
-rw------- 1 postgres postgres 16777216 2009-09-07 22:03 /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1
-rw------- 1 postgres postgres 0 2009-09-07 22:12 /var/lib/postgresql/8.3/main/pg_xlog/archive_status/0000000100000065000000F1.ready

There's no 0000000100000065000000F0 or 0000000100000065000000F2:

cirkus:~# ls -lart /var/lib/postgresql/8.3/main/pg_xlog/ | head
total 7365516
-rw------- 1 postgres postgres 16777216 2009-09-07 22:03 0000000100000065000000F1
-rw------- 1 postgres postgres 16777216 2009-09-07 22:17 000000010000006600000004
-rw------- 1 postgres postgres 16777216 2009-09-07 22:22 000000010000006600000005
-rw------- 1 postgres postgres 16777216 2009-09-07 22:27 000000010000006600000006
-rw------- 1 postgres postgres 16777216 2009-09-07 22:32 000000010000006600000007
-rw------- 1 postgres postgres 16777216 2009-09-07 22:37 000000010000006600000008
-rw------- 1 postgres postgres 16777216 2009-09-07 22:42 000000010000006600000009
-rw------- 1 postgres postgres 16777216 2009-09-07 22:47 00000001000000660000000A
-rw------- 1 postgres postgres 16777216 2009-09-07 22:52 00000001000000660000000B

The files in question do not share MD5 sum:

pgbackup(at)kolje:~$ md5sum pgbackup-cirkus/0000000100000065000000F1
341f308f3269176873f93b19496b2923 pgbackup-cirkus/0000000100000065000000F1
cirkus:~# md5sum /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1
8010976a8a95480916ff2bae9203a2cf /var/lib/postgresql/8.3/main/pg_xlog/0000000100000065000000F1

How do I continue debugging this? How did a file name collision occur?

-B

Server configuration:

name | setting
---------------------------------+-----------------------------------------------------------------------------------------------------
add_missing_from | off
allow_system_table_mods | off
archive_command | ssh pgbackup(at)backup test ! -f pgbackup-cirkus/%f && rsync -az %p pgbackup(at)backup:pgbackup-cirkus/%f
archive_mode | on
archive_timeout | 300
array_nulls | on
authentication_timeout | 60
autovacuum | on
autovacuum_analyze_scale_factor | 0.3
autovacuum_analyze_threshold | 100
autovacuum_freeze_max_age | 200000000
autovacuum_max_workers | 3
autovacuum_naptime | 60
autovacuum_vacuum_cost_delay | -1
autovacuum_vacuum_cost_limit | -1
autovacuum_vacuum_scale_factor | 0.4
autovacuum_vacuum_threshold | 400
backslash_quote | safe_encoding
bgwriter_delay | 200
bgwriter_lru_maxpages | 100
bgwriter_lru_multiplier | 2
block_size | 8192
bonjour_name |
check_function_bodies | on
checkpoint_completion_target | 0.5
checkpoint_segments | 15
checkpoint_timeout | 1800
checkpoint_warning | 30
client_encoding | UTF8
client_min_messages | notice
commit_delay | 0
commit_siblings | 5
config_file | /etc/postgresql/8.3/main/postgresql.conf
constraint_exclusion | off
cpu_index_tuple_cost | 0.005
cpu_operator_cost | 0.0025
cpu_tuple_cost | 0.01
custom_variable_classes |
data_directory | /var/lib/postgresql/8.3/main
DateStyle | ISO, MDY
db_user_namespace | off
deadlock_timeout | 1000
debug_assertions | off
debug_pretty_print | on
debug_print_parse | off
debug_print_plan | on
debug_print_rewritten | off
default_statistics_target | 10
default_tablespace |
default_text_search_config | pg_catalog.simple
default_transaction_isolation | read committed
default_transaction_read_only | off
default_with_oids | off
dynamic_library_path | $libdir
effective_cache_size | 393216
enable_bitmapscan | on
enable_hashagg | on
enable_hashjoin | on
enable_indexscan | on
enable_mergejoin | on
enable_nestloop | on
enable_seqscan | on
enable_sort | on
enable_tidscan | on
escape_string_warning | on
explain_pretty_print | on
external_pid_file | /var/run/postgresql/8.3-main.pid
extra_float_digits | 0
from_collapse_limit | 8
fsync | on
full_page_writes | on
geqo | on
geqo_effort | 5
geqo_generations | 0
geqo_pool_size | 0
geqo_selection_bias | 2
geqo_threshold | 12
gin_fuzzy_search_limit | 0
hba_file | /etc/postgresql/8.3/main/pg_hba.conf
ident_file | /etc/postgresql/8.3/main/pg_ident.conf
ignore_system_indexes | off
integer_datetimes | on
join_collapse_limit | 8
krb_caseins_users | off
krb_realm |
krb_server_hostname |
krb_server_keyfile | /etc/sql.keytab
krb_srvname | postgres
lc_collate | en_US.UTF-8
lc_ctype | en_US.UTF-8
lc_messages |
lc_monetary | C
lc_numeric | C
lc_time | C
listen_addresses | *
local_preload_libraries |
log_autovacuum_min_duration | -1
log_checkpoints | on
log_connections | on
log_destination | stderr
log_directory | pg_log
log_disconnections | off
log_duration | off
log_error_verbosity | default
log_executor_stats | off
log_filename | postgresql-%Y-%m-%d_%H%M%S.log
log_hostname | off
log_line_prefix | %t %p %d %u
log_lock_waits | on
log_min_duration_statement | 100
log_min_error_statement | FATAL
log_min_messages | INFO
log_parser_stats | off
log_planner_stats | off
log_rotation_age | 1440
log_rotation_size | 10240
log_statement | ddl
log_statement_stats | off
log_temp_files | 1
log_timezone | localtime
log_truncate_on_rotation | off
logging_collector | off
maintenance_work_mem | 16384
max_connections | 256
max_files_per_process | 1000
max_fsm_pages | 2000000
max_fsm_relations | 35000
max_function_args | 100
max_identifier_length | 63
max_index_keys | 32
max_locks_per_transaction | 64
max_prepared_transactions | 100
max_stack_depth | 2048
password_encryption | on
port | 5432
post_auth_delay | 0
pre_auth_delay | 0
random_page_cost | 4
regex_flavor | advanced
search_path | "$user",public
seq_page_cost | 1
server_encoding | UTF8
server_version | 8.3.7
server_version_num | 80307
session_replication_role | origin
shared_buffers | 20000
shared_preload_libraries |
silent_mode | off
sql_inheritance | on
ssl | on
ssl_ciphers | ALL:!ADH:!LOW:!EXP:!MD5:@STRENGTH
standard_conforming_strings | off
statement_timeout | 0
superuser_reserved_connections | 3
synchronize_seqscans | on
synchronous_commit | on
syslog_facility | LOCAL0
syslog_ident | postgres
tcp_keepalives_count | 0
tcp_keepalives_idle | 0
tcp_keepalives_interval | 0
temp_buffers | 1024
temp_tablespaces |
TimeZone | localtime
timezone_abbreviations | Default
trace_notify | off
trace_sort | off
track_activities | on
track_counts | on
transaction_isolation | read committed
transaction_read_only | off
transform_null_equals | off
unix_socket_directory | /var/run/postgresql
unix_socket_group |
unix_socket_permissions | 511
update_process_title | on
vacuum_cost_delay | 0
vacuum_cost_limit | 200
vacuum_cost_page_dirty | 20
vacuum_cost_page_hit | 1
vacuum_cost_page_miss | 10
vacuum_freeze_min_age | 100000000
wal_buffers | 8
wal_sync_method | open_sync
wal_writer_delay | 200
work_mem | 81920
xmlbinary | base64
xmloption | content
zero_damaged_pages | off
(189 rows)

--
Berge Schwebs Bjørlo
Alegría!

Responses

Browse pgsql-general by date

  From Date Subject
Next Message David W Noon 2009-09-09 11:42:31 Re: where clauses and multiple tables
Previous Message Korry 2009-09-09 09:35:18 Re: Error log fillingup