Re: postgresql.conf recommendations

From: Johnny Tan <johnnydtan(at)gmail(dot)com>
To: Josh Krupka <jkrupka(at)gmail(dot)com>
Cc: Alex Kahn <alex(at)paperlesspost(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: postgresql.conf recommendations
Date: 2013-02-05 23:46:37
Message-ID: CABMVzL3mmx8KOBCrVQzff3FciXg_DJGULd4f3an4a417FUb1XQ@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

# cat /sys/kernel/mm/redhat_transparent_hugepage/defrag
[always] never

On Tue, Feb 5, 2013 at 5:37 PM, Josh Krupka <jkrupka(at)gmail(dot)com> wrote:

> Just out of curiosity, are you using transparent huge pages?
> On Feb 5, 2013 5:03 PM, "Johnny Tan" <johnnydtan(at)gmail(dot)com> wrote:
>
>> Server specs:
>> Dell R610
>> dual E5645 hex-core 2.4GHz
>> 192GB RAM
>> RAID 1: 2x400GB SSD (OS + WAL logs)
>> RAID 10: 4x400GB SSD (/var/lib/pgsql)
>>
>>
>> /etc/sysctl.conf:
>> kernel.msgmnb = 65536
>> kernel.msgmax = 65536
>> kernel.shmmax = 68719476736
>> kernel.shmall = 4294967296
>> vm.overcommit_memory = 0
>> vm.swappiness = 0
>> vm.dirty_background_bytes = 536870912
>> vm.dirty_bytes = 536870912
>>
>>
>> postgresql.conf:
>> listen_addresses = '*' # what IP address(es) to listen on;
>> max_connections = 150 # (change requires restart)
>> shared_buffers = 48GB # min 128kB
>> work_mem = 1310MB # min 64kB
>> maintenance_work_mem = 24GB # min 1MB
>> wal_level = hot_standby # minimal, archive, or hot_standby
>> checkpoint_segments = 64 # in logfile segments, min 1, 16MB each
>> checkpoint_timeout = 30min # range 30s-1h
>> checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 -
>> 1.0
>> max_wal_senders = 5 # max number of walsender processes
>> wal_keep_segments = 2000 # in logfile segments, 16MB each; 0 disables
>> hot_standby = on # "on" allows queries during recovery
>> max_standby_archive_delay = 120s # max delay before canceling queries
>> max_standby_streaming_delay = 120s # max delay before canceling queries
>> effective_cache_size = 120GB
>> constraint_exclusion = partition # on, off, or partition
>> log_destination = 'syslog' # Valid values are combinations of
>> logging_collector = on # Enable capturing of stderr and csvlog
>> log_directory = 'pg_log' # directory where log files are written,
>> log_filename = 'postgresql-%a.log' # log file name pattern,
>> log_truncate_on_rotation = on # If on, an existing log file with the
>> log_rotation_age = 1d # Automatic rotation of logfiles will
>> log_rotation_size = 0 # Automatic rotation of logfiles will
>> log_min_duration_statement = 500 # -1 is disabled, 0 logs all statements
>> log_checkpoints = on
>> log_line_prefix = 'user=%u db=%d remote=%r ' # special values:
>> log_lock_waits = on # log lock waits >= deadlock_timeout
>> autovacuum = on # Enable autovacuum subprocess? 'on'
>> log_autovacuum_min_duration = 0 # -1 disables, 0 logs all actions and
>> autovacuum_max_workers = 5 # max number of autovacuum subprocesses
>> datestyle = 'iso, mdy'
>> lc_messages = 'en_US.UTF-8' # locale for system error message
>> lc_monetary = 'en_US.UTF-8' # locale for monetary formatting
>> lc_numeric = 'en_US.UTF-8' # locale for number formatting
>> lc_time = 'en_US.UTF-8' # locale for time formatting
>> default_text_search_config = 'pg_catalog.english'
>> deadlock_timeout = 300ms
>>
>>
>> per pgtune:
>>
>> #------------------------------------------------------------------------------
>> # pgtune wizard run on 2013-02-05
>> # Based on 198333224 KB RAM in the server
>>
>> #------------------------------------------------------------------------------
>> default_statistics_target = 100
>> maintenance_work_mem = 1GB
>> checkpoint_completion_target = 0.9
>> effective_cache_size = 128GB
>> work_mem = 1152MB
>> wal_buffers = 8MB
>> checkpoint_segments = 16
>> shared_buffers = 44GB
>> max_connections = 80
>>
>> We use pgbouncer (set to 140 connections) in transaction pooling mode in
>> front of our db.
>>
>>
>> The problem:
>>
>> For the most part, the server hums along. No other applications run on
>> this server other than postgres. Load averages rarely break 2.0, it never
>> swaps, and %iowait is usually not more than 0.12
>>
>> But periodically, there are spikes in our app's db response time.
>> Normally, the app's db response time hovers in the 100ms range for most of
>> the day. During the spike times, it can go up to 1000ms or 1500ms, and the
>> number of pg connections goes to 140 (maxed out to pgbouncer's limit, where
>> normally it's only about 20-40 connections). Also, during these times,
>> which usually last less than 2 minutes, we will see several thousand
>> queries in the pg log (this is with log_min_duration_statement = 500),
>> compared to maybe one or two dozen 500ms+ queries in non-spike times.
>>
>> Inbetween spikes could be an hour, two hours, sometimes half a day. There
>> doesn't appear to be any pattern that we can see:
>> * there are no obvious queries that are locking the db
>> * it doesn't necessarily happen during high-traffic times, though it can
>> * it doesn't happen during any known system, db, or app
>> regularly-scheduled job, including crons
>> * in general, there's no discernible regularity to it at all
>> * it doesn't coincide with checkpoint starts or completions
>> * it doesn't coincide with autovacuums
>> * there are no messages in any system logs that might indicate any system
>> or hardware-related issue
>>
>> Besides spikes in our graphs, the only other visible effect is that
>> %system in sar goes from average of 0.7 to as high as 10.0 or so (%iowait
>> and all other sar variables remain the same).
>>
>> And according to our monitoring system, web requests get queued up, and
>> our alerting system sometimes either says there's a timeout or that it had
>> multiple web response times greater than 300ms, and so we suspect (but have
>> no proof) that some users will see either a long hang or possibly a
>> timeout. But since it's almost always less than two minutes, and sometimes
>> less than one, we don't really hear any complaints (guessing that most
>> people hit reload, and things work again, so they continue on), and we
>> haven't been able to see any negative effect ourselves.
>>
>> But we want to get in front of the problem, in case it is something that
>> will get worse as traffic continues to grow. We've tweaked various configs
>> on the OS side as well as the postgresql.conf side. What's posted above is
>> our current setup, and the problem persists.
>>
>> Any ideas as to where we could even look?
>>
>> Also, whether related or unrelated to the spikes, are there any
>> recommendations for our postgresql.conf or sysctl.conf based on our
>> hardware? From pgtune's output, I am lowering maintenance_work_mem from
>> 24GB down to maybe 2GB, but I keep reading conflicting things about other
>> settings, such as checkpoints or max_connections.
>>
>> johnny
>>
>>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Josh Krupka 2013-02-06 04:23:35 Re: postgresql.conf recommendations
Previous Message Josh Krupka 2013-02-05 22:37:21 Re: postgresql.conf recommendations