Postgres slave not catching up (on 9.2)

From: Ruben Domingo Gaspar Aparicio <Ruben(dot)Gaspar(dot)Aparicio(at)cern(dot)ch>
To: "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Cc: "dbondemand-admin (DB On Demand administrators)" <dbondemand-admin(at)cern(dot)ch>
Subject: Postgres slave not catching up (on 9.2)
Date: 2014-11-08 13:11:25
Message-ID: 6FCBD86BDC789C43AC696969EC340228010939EB7E@CERNXCHG44.cern.ch
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Hello,

I have built up a hot-standby between a master running 9.2.4 and a slave running 9.2.9. I did the initial copy using "pg_basebackup" . My recovery.conf looks like:

standby_mode = 'on'
primary_conninfo = 'host=pXXXXXXXX port=XXXX user=replicator'
trigger_file = 'failover.now'
restore_command = 'test -f /ORA/dbs02/PUPDBTST/archive/%f && ln -fns /ORA/dbs02/PUPDBTST/archive/%f %p'
archive_cleanup_command = '/usr/local/pgsql/postgresql-9.2.9/bin/pg_archivecleanup /ORA/dbs03/PUPDBTST/data %r'

The slave (I don't have control on the master) is using 2 NFS file systems, one for WALs and another one for the data, on Netapp controllers:

dbnasg401-12a:/vol/dodpupdbtst02 on /ORA/dbs02/PUPDBTST type nfs (rw,remount,hard,nointr,rsize=32768,wsize=32768,tcp,actimeo=0,timeo=600)
dbnasg403-12a:/vol/dodpupdbtst03 on /ORA/dbs03/PUPDBTST type nfs (rw,remount,hard,nointr,rsize=32768,wsize=32768,tcp,actimeo=0,timeo=600)

The master produces quite a lot of WALs. This is what I get on the slave (number of WAL files, date-hour, Total size in MB), so per day is more than 400GB:

1582 20140930-17 25312MB
1328 20140930-18 21248MB
1960 20140930-19 31360MB
1201 20140930-20 19216MB
1467 20140930-21 23472MB
1298 20140930-22 20768MB
1579 20140930-23 25264MB
1646 20141001-00 26336MB
1274 20141001-01 20384MB
1652 20141001-02 26432MB
1756 20141001-03 28096MB
1628 20141001-04 26048MB
1015 20141001-05 16240MB
1624 20141001-06 25984MB
1652 20141001-07 26432MB
1286 20141001-08 20576MB
1485 20141001-09 23760MB
1987 20141001-10 31792MB
1432 20141001-11 22912MB
1235 20141001-12 19760MB
1690 20141001-13 27040MB
1442 20141001-14 23072MB
1500 20141001-15 24000MB
1306 20141001-16 20896MB
1491 20141001-17 23856MB
1535 20141001-18 24560MB
1548 20141001-19 24768MB
1068 20141001-20 17088MB
1519 20141001-21 24304MB
1608 20141001-22 25728MB
1019 20141001-23 16304MB
1568 20141002-00 25088MB
1411 20141002-01 22576MB
1781 20141002-02 28496MB
1280 20141002-03 20480MB
1556 20141002-04 24896MB
1114 20141002-05 17824MB
1906 20141002-06 30496MB
1316 20141002-07 21056MB
1483 20141002-08 23728MB
1245 20141002-09 19920MB

The slave is running on a server with 48GB, and 8 cores (Intel(R) Xeon(R) CPU E5630 @ 2.53GHz) running red hat 5.10, herewith the postgresql.conf:
postgres=# show all;
name | setting | description
---------------------------------+------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------
allow_system_table_mods | off | Allows modifications of the structure of system tables.
application_name | psql | Sets the application name to be reported in statistics and logs.
archive_command | (disabled) | Sets the shell command that will be called to archive a WAL file.
archive_mode | off | Allows archiving of WAL files using archive_command.
archive_timeout | 0 | Forces a switch to the next xlog file if a new file has not been started within N seconds.
array_nulls | on | Enable input of NULL elements in arrays.
authentication_timeout | 1min | Sets the maximum allowed time to complete client authentication.
autovacuum | on | Starts the autovacuum subprocess.
autovacuum_analyze_scale_factor | 0.1 | Number of tuple inserts, updates, or deletes prior to analyze as a fraction of reltuples.
autovacuum_analyze_threshold | 50 | Minimum number of tuple inserts, updates, or deletes prior to analyze.
autovacuum_freeze_max_age | 200000000 | Age at which to autovacuum a table to prevent transaction ID wraparound.
autovacuum_max_workers | 3 | Sets the maximum number of simultaneously running autovacuum worker processes.
autovacuum_naptime | 20s | Time to sleep between autovacuum runs.
autovacuum_vacuum_cost_delay | 5ms | Vacuum cost delay in milliseconds, for autovacuum.
autovacuum_vacuum_cost_limit | 1000 | Vacuum cost amount available before napping, for autovacuum.
autovacuum_vacuum_scale_factor | 0.1 | Number of tuple updates or deletes prior to vacuum as a fraction of reltuples.
autovacuum_vacuum_threshold | 50 | Minimum number of tuple updates or deletes prior to vacuum.
backslash_quote | safe_encoding | Sets whether "\'" is allowed in string literals.
bgwriter_delay | 200ms | Background writer sleep time between rounds.
bgwriter_lru_maxpages | 100 | Background writer maximum number of LRU pages to flush per round.
bgwriter_lru_multiplier | 2 | Multiple of the average buffer usage to free per round.
block_size | 8192 | Shows the size of a disk block.
bonjour | off | Enables advertising the server via Bonjour.
bonjour_name | | Sets the Bonjour service name.
bytea_output | hex | Sets the output format for bytea.
check_function_bodies | on | Check function bodies during CREATE FUNCTION.
checkpoint_completion_target | 0.9 | Time spent flushing dirty buffers during checkpoint, as fraction of checkpoint interval.
checkpoint_segments | 64 | Sets the maximum distance in log segments between automatic WAL checkpoints.
checkpoint_timeout | 10min | Sets the maximum time between automatic WAL checkpoints.
checkpoint_warning | 1min | Enables warnings if checkpoint segments are filled more frequently than this.
client_encoding | UTF8 | Sets the client's character set encoding.
client_min_messages | notice | Sets the message levels that are sent to the client.
commit_delay | 0 | Sets the delay in microseconds between transaction commit and flushing WAL to disk.
commit_siblings | 5 | Sets the minimum concurrent open transactions before performing commit_delay.
config_file | /ORA/dbs03/PUPDBTST/data/postgresql.conf | Sets the server's main configuration file.
constraint_exclusion | on | Enables the planner to use constraints to optimize queries.
cpu_index_tuple_cost | 0.005 | Sets the planner's estimate of the cost of processing each index entry during an index scan.
cpu_operator_cost | 0.0025 | Sets the planner's estimate of the cost of processing each operator or function call.
cpu_tuple_cost | 0.01 | Sets the planner's estimate of the cost of processing each tuple (row).
cursor_tuple_fraction | 0.1 | Sets the planner's estimate of the fraction of a cursor's rows that will be retrieved.
data_directory | /ORA/dbs03/PUPDBTST/data | Sets the server's data directory.
DateStyle | ISO, MDY | Sets the display format for date and time values.
db_user_namespace | off | Enables per-database user names.
deadlock_timeout | 1s | Sets the time to wait on a lock before checking for deadlock.
debug_assertions | off | Turns on various assertion checks.
debug_pretty_print | on | Indents parse and plan tree displays.
debug_print_parse | off | Logs each query's parse tree.
debug_print_plan | off | Logs each query's execution plan.
debug_print_rewritten | off | Logs each query's rewritten parse tree.
default_statistics_target | 50 | Sets the default statistics target.
default_tablespace | | Sets the default tablespace to create tables and indexes in.
default_text_search_config | pg_catalog.english | Sets default text search configuration.
default_transaction_deferrable | off | Sets the default deferrable status of new transactions.
default_transaction_isolation | read committed | Sets the transaction isolation level of each new transaction.
default_transaction_read_only | off | Sets the default read-only status of new transactions.
default_with_oids | off | Create new tables with OIDs by default.
dynamic_library_path | $libdir | Sets the path for dynamically loadable modules.
effective_cache_size | 36GB | Sets the planner's assumption about the size of the disk cache.
effective_io_concurrency | 1 | Number of simultaneous requests that can be handled efficiently by the disk subsystem.
enable_bitmapscan | on | Enables the planner's use of bitmap-scan plans.
enable_hashagg | on | Enables the planner's use of hashed aggregation plans.
enable_hashjoin | on | Enables the planner's use of hash join plans.
enable_indexonlyscan | on | Enables the planner's use of index-only-scan plans.
enable_indexscan | on | Enables the planner's use of index-scan plans.
enable_material | on | Enables the planner's use of materialization.
enable_mergejoin | on | Enables the planner's use of merge join plans.
enable_nestloop | on | Enables the planner's use of nested-loop join plans.
enable_seqscan | on | Enables the planner's use of sequential-scan plans.
enable_sort | on | Enables the planner's use of explicit sort steps.
enable_tidscan | on | Enables the planner's use of TID scan plans.
escape_string_warning | on | Warn about backslash escapes in ordinary string literals.
event_source | PostgreSQL | Sets the application name used to identify PostgreSQL messages in the event log.
exit_on_error | off | Terminate session on any error.
external_pid_file | | Writes the postmaster PID to the specified file.
extra_float_digits | 0 | Sets the number of digits displayed for floating-point values.
from_collapse_limit | 8 | Sets the FROM-list size beyond which subqueries are not collapsed.
fsync | off | Forces synchronization of updates to disk.
full_page_writes | on | Writes full pages to WAL when first modified after a checkpoint.
geqo | on | Enables genetic query optimization.
geqo_effort | 5 | GEQO: effort is used to set the default for other GEQO parameters.
geqo_generations | 0 | GEQO: number of iterations of the algorithm.
geqo_pool_size | 0 | GEQO: number of individuals in the population.
geqo_seed | 0 | GEQO: seed for random path selection.
geqo_selection_bias | 2 | GEQO: selective pressure within the population.
geqo_threshold | 12 | Sets the threshold of FROM items beyond which GEQO is used.
gin_fuzzy_search_limit | 0 | Sets the maximum allowed result for exact search by GIN.
hba_file | /ORA/dbs03/PUPDBTST/data/pg_hba.conf | Sets the server's "hba" configuration file.
hot_standby | on | Allows connections and queries during recovery.
hot_standby_feedback | off | Allows feedback from a hot standby to the primary that will avoid query conflicts.
ident_file | /ORA/dbs03/PUPDBTST/data/pg_ident.conf | Sets the server's "ident" configuration file.
ignore_system_indexes | off | Disables reading from system indexes.
integer_datetimes | on | Datetimes are integer based.
IntervalStyle | postgres | Sets the display format for interval values.
join_collapse_limit | 8 | Sets the FROM-list size beyond which JOIN constructs are not flattened.
krb_caseins_users | off | Sets whether Kerberos and GSSAPI user names should be treated as case-insensitive.
krb_server_keyfile | FILE:/etc/postgresql/krb5.keytab | Sets the location of the Kerberos server key file.
krb_srvname | postgres | Sets the name of the Kerberos service.
lc_collate | en_US.UTF-8 | Shows the collation order locale.
lc_ctype | en_US.UTF-8 | Shows the character classification and case conversion locale.
lc_messages | en_US.UTF-8 | Sets the language in which messages are displayed.
lc_monetary | en_US.UTF-8 | Sets the locale for formatting monetary amounts.
lc_numeric | en_US.UTF-8 | Sets the locale for formatting numbers.
lc_time | en_US.UTF-8 | Sets the locale for formatting date and time values.
listen_addresses | * | Sets the host name or IP address(es) to listen to.
lo_compat_privileges | off | Enables backward compatibility mode for privilege checks on large objects.
local_preload_libraries | | Lists shared libraries to preload into each backend.
log_autovacuum_min_duration | 0 | Sets the minimum execution time above which autovacuum actions will be logged.
log_checkpoints | off | Logs each checkpoint.
log_connections | off | Logs each successful connection.
log_destination | stderr | Sets the destination for server log output.
log_directory | pg_log | Sets the destination directory for log files.
log_disconnections | off | Logs end of a session, including duration.
log_duration | off | Logs the duration of each completed SQL statement.
log_error_verbosity | default | Sets the verbosity of logged messages.
log_executor_stats | off | Writes executor performance statistics to the server log.
log_file_mode | 0600 | Sets the file permissions for log files.
log_filename | postgresql-%Y-%m-%d_%H%M%S.log | Sets the file name pattern for log files.
log_hostname | off | Logs the host name in the connection logs.
log_line_prefix | <%m> | Controls information prefixed to each log line.
log_lock_waits | off | Logs long lock waits.
log_min_duration_statement | -1 | Sets the minimum execution time above which statements will be logged.
log_min_error_statement | error | Causes all statements generating error at or above this level to be logged.
log_min_messages | warning | Sets the message levels that are logged.
log_parser_stats | off | Writes parser performance statistics to the server log.
log_planner_stats | off | Writes planner performance statistics to the server log.
log_rotation_age | 1d | Automatic log file rotation will occur after N minutes.
log_rotation_size | 10MB | Automatic log file rotation will occur after N kilobytes.
log_statement | none | Sets the type of statements logged.
log_statement_stats | off | Writes cumulative performance statistics to the server log.
log_temp_files | -1 | Log the use of temporary files larger than this number of kilobytes.
log_timezone | Europe/Paris | Sets the time zone to use in log messages.
log_truncate_on_rotation | off | Truncate existing log files of same name during log rotation.
logging_collector | off | Start a subprocess to capture stderr output and/or csvlogs into log files.
maintenance_work_mem | 1GB | Sets the maximum memory to be used for maintenance operations.
max_connections | 80 | Sets the maximum number of concurrent connections.
max_files_per_process | 1000 | Sets the maximum number of simultaneously open files for each server process.
max_function_args | 100 | Shows the maximum number of function arguments.
max_identifier_length | 63 | Shows the maximum identifier length.
max_index_keys | 32 | Shows the maximum number of index keys.
max_locks_per_transaction | 64 | Sets the maximum number of locks per transaction.
max_pred_locks_per_transaction | 64 | Sets the maximum number of predicate locks per transaction.
max_prepared_transactions | 0 | Sets the maximum number of simultaneously prepared transactions.
max_stack_depth | 2MB | Sets the maximum stack depth, in kilobytes.
max_standby_archive_delay | 30s | Sets the maximum delay before canceling queries when a hot standby server is processing archived WAL data.
max_standby_streaming_delay | 30s | Sets the maximum delay before canceling queries when a hot standby server is processing streamed WAL data.
max_wal_senders | 0 | Sets the maximum number of simultaneously running WAL sender processes.
password_encryption | on | Encrypt passwords.
port | 6600 | Sets the TCP port the server listens on.
post_auth_delay | 0 | Waits N seconds on connection startup after authentication.
pre_auth_delay | 0 | Waits N seconds on connection startup before authentication.
quote_all_identifiers | off | When generating SQL fragments, quote all identifiers.
random_page_cost | 4 | Sets the planner's estimate of the cost of a nonsequentially fetched disk page.
replication_timeout | 1min | Sets the maximum time to wait for WAL replication.
restart_after_crash | on | Reinitialize server after backend crash.
search_path | "$user",public | Sets the schema search order for names that are not schema-qualified.
segment_size | 1GB | Shows the number of pages per disk file.
seq_page_cost | 1 | Sets the planner's estimate of the cost of a sequentially fetched disk page.
server_encoding | UTF8 | Sets the server (database) character set encoding.
server_version | 9.2.9 | Shows the server version.
server_version_num | 90209 | Shows the server version as an integer.
session_replication_role | origin | Sets the session's behavior for triggers and rewrite rules.
shared_buffers | 12GB | Sets the number of shared memory buffers used by the server.
shared_preload_libraries | | Lists shared libraries to preload into server.
sql_inheritance | on | Causes subtables to be included by default in various commands.
ssl | off | Enables SSL connections.
ssl_ca_file | | Location of the SSL certificate authority file.
ssl_cert_file | server.crt | Location of the SSL server certificate file.
ssl_ciphers | ALL:!ADH:!LOW:!EXP:!MD5:@STRENGTH | Sets the list of allowed SSL ciphers.
ssl_crl_file | | Location of the SSL certificate revocation list file.
ssl_key_file | server.key | Location of the SSL server private key file.
ssl_renegotiation_limit | 512MB | Set the amount of traffic to send and receive before renegotiating the encryption keys.
standard_conforming_strings | on | Causes '...' strings to treat backslashes literally.
statement_timeout | 0 | Sets the maximum allowed duration of any statement.
stats_temp_directory | pg_stat_tmp | Writes temporary statistics files to the specified directory.
superuser_reserved_connections | 3 | Sets the number of connection slots reserved for superusers.
synchronize_seqscans | on | Enable synchronized sequential scans.
synchronous_commit | off | Sets the current transaction's synchronization level.
synchronous_standby_names | | List of names of potential synchronous standbys.
syslog_facility | local0 | Sets the syslog "facility" to be used when syslog enabled.
syslog_ident | postgres | Sets the program name used to identify PostgreSQL messages in syslog.
tcp_keepalives_count | 0 | Maximum number of TCP keepalive retransmits.
tcp_keepalives_idle | 0 | Time between issuing TCP keepalives.
tcp_keepalives_interval | 0 | Time between TCP keepalive retransmits.
temp_buffers | 8MB | Sets the maximum number of temporary buffers used by each session.
temp_file_limit | -1 | Limits the total size of all temporary files used by each session.
temp_tablespaces | | Sets the tablespace(s) to use for temporary tables and sort files.
TimeZone | Europe/Paris | Sets the time zone for displaying and interpreting time stamps.
timezone_abbreviations | Default | Selects a file of time zone abbreviations.
trace_notify | off | Generates debugging output for LISTEN and NOTIFY.
trace_recovery_messages | log | Enables logging of recovery-related debugging information.
trace_sort | off | Emit information about resource usage in sorting.
track_activities | on | Collects information about executing commands.
track_activity_query_size | 1024 | Sets the size reserved for pg_stat_activity.query, in bytes.
track_counts | on | Collects statistics on database activity.
track_functions | none | Collects function-level statistics on database activity.
track_io_timing | off | Collects timing statistics for database I/O activity.
transaction_deferrable | off | Whether to defer a read-only serializable transaction until it can be executed with no possible serialization failures.
transaction_isolation | read committed | Sets the current transaction's isolation level.
transaction_read_only | on | Sets the current transaction's read-only status.
transform_null_equals | off | Treats "expr=NULL" as "expr IS NULL".
unix_socket_directory | /var/lib/pgsql | Sets the directory where the Unix-domain socket will be created.
unix_socket_group | | Sets the owning group of the Unix-domain socket.
unix_socket_permissions | 0777 | Sets the access permissions of the Unix-domain socket.
update_process_title | on | Updates the process title to show the active SQL command.
vacuum_cost_delay | 0 | Vacuum cost delay in milliseconds.
vacuum_cost_limit | 200 | Vacuum cost amount available before napping.
vacuum_cost_page_dirty | 20 | Vacuum cost for a page dirtied by vacuum.
vacuum_cost_page_hit | 1 | Vacuum cost for a page found in the buffer cache.
vacuum_cost_page_miss | 10 | Vacuum cost for a page not found in the buffer cache.
vacuum_defer_cleanup_age | 0 | Number of transactions by which VACUUM and HOT cleanup should be deferred, if any.
vacuum_freeze_min_age | 50000000 | Minimum age at which VACUUM should freeze a table row.
vacuum_freeze_table_age | 150000000 | Age at which VACUUM should scan whole table to freeze tuples.
wal_block_size | 8192 | Shows the block size in the write ahead log.
wal_buffers | 8MB | Sets the number of disk-page buffers in shared memory for WAL.
wal_keep_segments | 0 | Sets the number of WAL files held for standby servers.
wal_level | hot_standby | Set the level of information written to the WAL.
wal_receiver_status_interval | 10s | Sets the maximum interval between WAL receiver status reports to the primary.
wal_segment_size | 16MB | Shows the number of pages per write ahead log segment.
wal_sync_method | fdatasync | Selects the method used for forcing WAL updates to disk.
wal_writer_delay | 200ms | WAL writer sleep time between WAL flushes.
work_mem | 288MB | Sets the maximum memory to be used for query workspaces.
xmlbinary | base64 | Sets how binary values are to be encoded in XML.
xmloption | content | Sets whether XML data in implicit parsing and serialization operations is to be considered as documents or content fragments.
zero_damaged_pages | off | Continues processing past damaged page headers.

The streaming is working perfectly:

/usr/local/pgsql/postgresql-9.2.9/bin/psql -U admin -h XXXXX -p XXXXX -d puppetdb -c "select pid,usesysid, usename, application_name, client_addr, state, sent_location,write_location,replay_location,sync_state from pg_stat_replication;"
pid | usesysid | usename | application_name | client_addr | state | sent_location | write_location | replay_location | sync_state
--------+----------+------------+------------------+-------------+-----------+---------------+----------------+-----------------+------------
117659 | 16384 | replicator | walreceiver | 10.16.7.137 | streaming | AA74/DD630978 | AA74/DD630978 | A977/F84F0BE0 | async
(1 row)

But the lag is increasing constantly, it looks the replay can not cope with:

/usr/local/pgsql/postgresql-9.2.9/bin/psql -U postgres -h /var/lib/pgsql/ -p 6600 -d puppetdb -c "SELECT now(), now() - pg_last_xact_replay_timestamp() AS time_lag" | perl -ne 'if (/\|\s+(\d{2}):(\d{2}):(\d{2})\.\d+/) {$hour=$1;$min=$2;$sec=$3; print $_;}'

2014-09-22 18:40:02.004482+02 | 00:00:03.166557
2014-09-22 18:50:02.001836+02 | 00:00:00.765323
2014-09-22 19:00:01.229943+02 | 00:00:00.600354
2014-09-22 19:10:01.655343+02 | 00:00:07.85969
2014-09-22 19:20:01.872653+02 | 00:07:23.727307
2014-09-22 19:30:01.458236+02 | 00:14:16.244349
2014-09-22 19:40:01.715044+02 | 00:20:31.616665
2014-09-22 19:50:01.949646+02 | 00:28:20.129136
2014-09-22 20:00:01.216571+02 | 00:31:11.289404
...
2014-09-30 14:00:01.67156+02 | 23:20:11.229815
2014-09-30 14:10:01.884168+02 | 23:23:59.162476
2014-09-30 14:20:02.022649+02 | 23:26:39.082807
2014-09-30 14:30:01.263667+02 | 23:34:02.517874
2014-09-30 14:40:01.421162+02 | 23:41:30.496393
2014-09-30 14:50:01.650239+02 | 23:49:11.102335
2014-09-30 15:00:02.055655+02 | 23:56:07.012862

I tried to play with how the IO is handled, making it less strict setting synchronous_commit and fsync to off with not much success.
I have also done a second test increasing shared_buffers from 12GB to 24GB (we are running on a 48GB, 8 cores server).

Please let me know if you can see something obvious I am missing.
Thanks for your help,
Ruben

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Robert Klemme 2014-11-08 20:35:47 Re: Postgres slave not catching up (on 9.2)
Previous Message David G Johnston 2014-11-07 16:50:26 Re: Postgres does not use indexes with OR-conditions