Re: Performances issues with SSD volume ?

From: Thomas SIMON <tsimon(at)neteven(dot)com>
To: Glyn Astill <glynastill(at)yahoo(dot)co(dot)uk>
Cc: "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject: Re: Performances issues with SSD volume ?
Date: 2015-05-21 13:52:43
Message-ID: 555DE32B.4000706@neteven.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin

Le 21/05/2015 14:57, Glyn Astill a écrit :
>> From: Thomas SIMON <tsimon(at)neteven(dot)com>
>> To: Glyn Astill <glynastill(at)yahoo(dot)co(dot)uk>
>> Cc: "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
>> Sent: Thursday, 21 May 2015, 13:12
>> Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>
>> Le 20/05/2015 18:50, Glyn Astill a écrit :
>>>> From: Thomas SIMON <tsimon(at)neteven(dot)com>
>>>> To: glynastill(at)yahoo(dot)co(dot)uk
>>>> Cc: "pgsql-admin(at)postgresql(dot)org"
>> <pgsql-admin(at)postgresql(dot)org>
>>>> Sent: Wednesday, 20 May 2015, 16:41
>>>> Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>>>
>>>> Hi Glyn,
>>>>
>>>> I'll try to answer this points.
>>>>
>>>> I've made some benchs, and indeed 3.2 not helping. not helping at
>> all.
>>>> I changed to 3.14 and gap is quite big !
>>>> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in
>> same
>>>> conditions
>>>> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS,
>> same
>>>> conditions too.
>>> That's a start then.
>>>
>>>> It should so be better, but when server was in production, and ever
>> with
>>>> bad kernel, performances was already quite good before they quickly
>>>> decreased.
>>>> So i think too I have another configuration problem.
>>>>
>>>> You say you're IO bound, so some output from sar / iostat / dstat
>> and
>>>> pg_stat_activity etc before and during the issue would be of use.
>>>>
>>>> -> My server is not in production right now, so it is difficult to
>>>> replay production load and have some useful metrics.
>>>> The best way I've found is to replay trafic from logs with
>> pgreplay.
>>>> I hoped that the server falls back by replaying this traffic, but it
>>>> never happens ... Another thing I can't understand ...
>>>>
>>>> Below is my dstat output when I replay this traffic (and so when server
>>>> runs normally)
>>>> I have unfortunately no more outputs when server's performances
>> decreased.
>>> It's a shame we can't get any insight into activity on the server
>> during the issues.
>>>> Other things you asked
>>>>
>>>> System memory size : 256 Go
>>>> SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model
>>>> INTEL SSDSC2BB480G4
>>>> Raid controller : MegaRAID SAS 2208
>>>> Partition alignments and stripe sizes : see fdisk delow
>>>> Kernel options : the config file is here :
>>>>
>> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64
>>>> Filesystem used and mount options : ext4, see mtab below
>>>> IO Scheduler : noop [deadline] cfq for my ssd raid volume
>>>> Postgresql version and configuration : 9.3.5
>>>>
>>>> max_connections=1800
>>>> shared_buffers=8GB
>>>> temp_buffers=32MB
>>>> work_mem=100MB
>>>> maintenance_work_mem=12GB
>>>> bgwriter_lru_maxpages=200
>>>> effective_io_concurrency=4
>>>> wal_level=hot_standby
>>>> wal_sync_method=fdatasync
>>>> wal_writer_delay=2000ms
>>>> commit_delay=1000
>>>> checkpoint_segments=80
>>>> checkpoint_timeout=15min
>>>> checkpoint_completion_target=0.7
>>>> archive_command='rsync ....'
>>>> max_wal_senders=10
>>>> wal_keep_segments=38600
>>>> vacuum_defer_cleanup_age=100
>>>> hot_standby = on
>>>> max_standby_archive_delay = 5min
>>>> max_standby_streaming_delay = 5min
>>>> hot_standby_feedback = on
>>>> random_page_cost = 1.0
>>>> effective_cache_size = 240GB
>>>> log_min_error_statement = warning
>>>> log_min_duration_statement = 0
>>>> log_checkpoints = on
>>>> log_connections = on
>>>> log_disconnections = on
>>>> log_line_prefix = '%m|%u|%d|%c|'
>>>> log_lock_waits = on
>>>> log_statement = 'all'
>>>> log_timezone = 'localtime'
>>>> track_activities = on
>>>> track_functions = pl
>>>> track_activity_query_size = 8192
>>>> autovacuum_max_workers = 5
>>>> autovacuum_naptime = 30s
>>>> autovacuum_vacuum_threshold = 40
>>>> autovacuum_analyze_threshold = 20
>>>> autovacuum_vacuum_scale_factor = 0.10
>>>> autovacuum_analyze_scale_factor = 0.10
>>>> autovacuum_vacuum_cost_delay = 5ms
>>>> default_transaction_isolation = 'read committed'
>>>> max_locks_per_transaction = 128
>>>>
>>>>
>>>>
>>>> Connection pool sizing (pgpool2)
>>>> num_init_children = 1790
>>>> max_pool = 1
>>> 1800 is quite a lot of connections, and with max_pool=1 in pgpool
>> you're effectively just using pgpool as a proxy (as I recall, my memory is a
>> little fuzzy on pgpool now). Unless your app is stateful in some way or has
>> unique users for each of those 1800 connections you should lower the quantity of
>> active connections. A general starting point is usually cpu cores * 2, so you
>> could up max_pool and divide num_init_children by the same amount.
>>> Hard to say what you need to do without knowing what exactly you're
>>> doing though. What's the nature of the app(s)?
>> Yes, we just use it as a proxy for now.
>> We have approximately 100 different active users, doing for all of then
>> various number of connexions (twisted + zope apps)
>> result is ~ 900 idle connexions for ~ 60 active connexions, but
>> sometimes (when stopping/starting prod), we need almost double of
>> connexion because some twisted services don't stop their connexions
>> immediatly.
>> But this is the actual (working) configuration, and I don't think think
>> my performance disk is related to this.
> I think at this point you could do with going back and trying to reproduce the issue, then trace back up to pg_stat_activity to see what activity could be causing the disk i/o. I assume you've tried to reproduce the disk issues with a simple disk benchmark like bonnie++?
Yes, I think the same thing. Probably I will doing this tomorrow early
in the morning.
I tried to reproduce disk issues with different stress tests like
bonnie, fio, tsung, and I use a more realistic scenario with pgreplay to
reproduce my production trafic from postgresql logfile.
However, I'm note sure how to diagnostic performance issues.
I mean, if I see ssd are 100% full, how can I figure out why their
behavior changes ?

I'm asking myself another question, about master/slave configuration.
For doing my test, I will put my ssd server as slave of hdd server.
After that, I will promote him as master.
In case I still have performance issues and I must do a rollback, am I
necessarily forced to reconstruct completely my new slave (hdd) with
pg_basebackup (and wait some hours file are transferer), or can I
promote directly this old master as a slave without pending time to
reconstruct (as files should be the same on both servers) ?

Thanks
Thomas

In response to

Responses

Browse pgsql-admin by date

  From Date Subject
Next Message Glyn Astill 2015-05-21 14:30:55 Re: Performances issues with SSD volume ?
Previous Message shailesh 2015-05-21 13:18:13 Resync postgres database after master slave replication