Quick Links

Re: Performances issues with SSD volume ?

From:	Thomas SIMON <tsimon(at)neteven(dot)com>
To:	Glyn Astill <glynastill(at)yahoo(dot)co(dot)uk>
Cc:	"pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
Subject:	Re: Performances issues with SSD volume ?
Date:	2015-05-21 12:12:22
Message-ID:	555DCBA6.1050607@neteven.com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-admin

Le 20/05/2015 18:50, Glyn Astill a écrit :
>
>> From: Thomas SIMON <tsimon(at)neteven(dot)com>
>> To: glynastill(at)yahoo(dot)co(dot)uk
>> Cc: "pgsql-admin(at)postgresql(dot)org" <pgsql-admin(at)postgresql(dot)org>
>> Sent: Wednesday, 20 May 2015, 16:41
>> Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>
>> Hi Glyn,
>>
>> I'll try to answer this points.
>>
>> I've made some benchs, and indeed 3.2 not helping. not helping at all.
>> I changed to 3.14 and gap is quite big !
>> With pgbench RW test, 3.2 --> 4200 TPS ; 3.14 --> 6900 TPS in same
>> conditions
>> With pgbench RO test, 3.2 --> 37000 TPS ; 3.14 --> 95000 TPS, same
>> conditions too.
>
> That's a start then.
>
>> It should so be better, but when server was in production, and ever with
>> bad kernel, performances was already quite good before they quickly
>> decreased.
>> So i think too I have another configuration problem.
>>
>> You say you're IO bound, so some output from sar / iostat / dstat and
>> pg_stat_activity etc before and during the issue would be of use.
>>
>> -> My server is not in production right now, so it is difficult to
>> replay production load and have some useful metrics.
>> The best way I've found is to replay trafic from logs with pgreplay.
>> I hoped that the server falls back by replaying this traffic, but it
>> never happens ... Another thing I can't understand ...
>>
>> Below is my dstat output when I replay this traffic (and so when server
>> runs normally)
>> I have unfortunately no more outputs when server's performances decreased.
> It's a shame we can't get any insight into activity on the server during the issues.
>>
>> Other things you asked
>>
>> System memory size : 256 Go
>> SSD Model numbers and how many : 4 SSd disks ; RAID 10 ; model
>> INTEL SSDSC2BB480G4
>> Raid controller : MegaRAID SAS 2208
>> Partition alignments and stripe sizes : see fdisk delow
>> Kernel options : the config file is here :
>> ftp://ftp.ovh.net/made-in-ovh/bzImage/3.14.43/config-3.14.43-xxxx-std-ipv6-64
>> Filesystem used and mount options : ext4, see mtab below
>> IO Scheduler : noop [deadline] cfq for my ssd raid volume
>> Postgresql version and configuration : 9.3.5
>>
>> max_connections=1800
>> shared_buffers=8GB
>> temp_buffers=32MB
>> work_mem=100MB
>> maintenance_work_mem=12GB
>> bgwriter_lru_maxpages=200
>> effective_io_concurrency=4
>> wal_level=hot_standby
>> wal_sync_method=fdatasync
>> wal_writer_delay=2000ms
>> commit_delay=1000
>> checkpoint_segments=80
>> checkpoint_timeout=15min
>> checkpoint_completion_target=0.7
>> archive_command='rsync ....'
>> max_wal_senders=10
>> wal_keep_segments=38600
>> vacuum_defer_cleanup_age=100
>> hot_standby = on
>> max_standby_archive_delay = 5min
>> max_standby_streaming_delay = 5min
>> hot_standby_feedback = on
>> random_page_cost = 1.0
>> effective_cache_size = 240GB
>> log_min_error_statement = warning
>> log_min_duration_statement = 0
>> log_checkpoints = on
>> log_connections = on
>> log_disconnections = on
>> log_line_prefix = '%m|%u|%d|%c|'
>> log_lock_waits = on
>> log_statement = 'all'
>> log_timezone = 'localtime'
>> track_activities = on
>> track_functions = pl
>> track_activity_query_size = 8192
>> autovacuum_max_workers = 5
>> autovacuum_naptime = 30s
>> autovacuum_vacuum_threshold = 40
>> autovacuum_analyze_threshold = 20
>> autovacuum_vacuum_scale_factor = 0.10
>> autovacuum_analyze_scale_factor = 0.10
>> autovacuum_vacuum_cost_delay = 5ms
>> default_transaction_isolation = 'read committed'
>> max_locks_per_transaction = 128
>>
>>
>>
>> Connection pool sizing (pgpool2)
>> num_init_children = 1790
>> max_pool = 1
>
> 1800 is quite a lot of connections, and with max_pool=1 in pgpool you're effectively just using pgpool as a proxy (as I recall, my memory is a little fuzzy on pgpool now). Unless your app is stateful in some way or has unique users for each of those 1800 connections you should lower the quantity of active connections. A general starting point is usually cpu cores * 2, so you could up max_pool and divide num_init_children by the same amount.
>
> Hard to say what you need to do without knowing what exactly you're doing though. What's the nature of the app(s)?
Yes, we just use it as a proxy for now.
We have approximately 100 different active users, doing for all of then
various number of connexions (twisted + zope apps)
result is ~ 900 idle connexions for ~ 60 active connexions, but
sometimes (when stopping/starting prod), we need almost double of
connexion because some twisted services don't stop their connexions
immediatly.
But this is the actual (working) configuration, and I don't think think
my performance disk is related to this.
>
>> I also add megacli parameters :
>>
>> Virtual Drive: 2 (Target Id: 2)
>> Name :datassd
>> RAID Level : Primary-1, Secondary-0, RAID Level Qualifier-0
>> Size : 893.25 GB
>> Sector Size : 512
>> Is VD emulated : Yes
>> Mirror Data : 893.25 GB
>> State : Optimal
>> Strip Size : 256 KB
>> Number Of Drives per span:2
>> Span Depth : 2
>> Default Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write
>> Cache if Bad BBU
>> Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write
>> Cache if Bad BBU
>> Default Access Policy: Read/Write
>> Current Access Policy: Read/Write
>> Disk Cache Policy : Enabled
>> Encryption Type : None
>> Bad Blocks Exist: No
>> PI type: No PI
>>
>> Is VD Cached: No
>
> Not using your raid controllers write cache then? Not sure just how important that is with SSDs these days, but if you've got a BBU set it to "WriteBack". Also change "Cache if Bad BBU" to "No Write Cache if Bad BBU" if you do that.
No, I had read some megacli related docs about SSD, and the advice was
to put writethrough on disks. (see
http://wiki.mikejung.biz/LSI#Configure_LSI_Card_for_SSD_RAID) last section.
Disks are already in "No Write Cache if Bad BBU" mode. (wrote on
splitted line on my extract)
>
>
>> Other outputs :
>> fdisk -l
>>
>> Disk /dev/sdc: 959.1 GB, 959119884288 bytes
>> 255 heads, 63 sectors/track, 116606 cylinders, total 1873281024 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 4096 bytes
>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>> Disk identifier: 0x00000000
>>
>> Disk /dev/mapper/vg_datassd-lv_datassd: 751.6 GB, 751619276800 bytes
>> 255 heads, 63 sectors/track, 91379 cylinders, total 1468006400 sectors
>> Units = sectors of 1 * 512 = 512 bytes
>> Sector size (logical/physical): 512 bytes / 4096 bytes
>> I/O size (minimum/optimal): 4096 bytes / 4096 bytes
>> Disk identifier: 0x00000000
>>
>>
>> cat /etc/mtab
>> /dev/mapper/vg_datassd-lv_datassd /datassd ext4
>> rw,relatime,discard,nobarrier,data=ordered 0 0
>> (I added nobarrier option)
>>
>>
>> cat /sys/block/sdc/queue/scheduler
>> noop [deadline] cfq
>>
>
> You could swap relatime for noatime,nodiratime.
I'll swap to noatime, thanks.
>
>> sysctl kernel | grep sched
>> kernel.sched_child_runs_first = 0
>> kernel.sched_rr_timeslice_ms = 25
>> kernel.sched_rt_period_us = 1000000
>> kernel.sched_rt_runtime_us = 950000
>>
>> I've read some advices about changing kernel.sched_autogroup_enabled=0
>> and kernel.sched_migration_cost_ns=5000000, but this parameters are not
>> recognized by my kernel. So I don't know what to do with that...
> sched_migration_cost_ns would be called sched_migration_cost in your old 3.2 kernel, not sure why
> sched_autogroup_enabled wouldn't be recognized though.
I've found on config file that "CONFIG_SCHED_AUTOGROUP is not set" in
this kernel.
So i guess it is the same thing as if it was enabled=0 , right ?
I haven't found any parameter related to migration_sost in config file.

>
>> sysctl vm
>> vm.admin_reserve_kbytes = 8192
>> vm.block_dump = 0
>> vm.dirty_background_bytes = 8388608
>> vm.dirty_background_ratio = 0
>> vm.dirty_bytes = 67108864
>> vm.dirty_expire_centisecs = 3000
>> vm.dirty_ratio = 0
>> vm.dirty_writeback_centisecs = 500
>> vm.drop_caches = 3
>> vm.laptop_mode = 0
>> vm.legacy_va_layout = 0
>> vm.lowmem_reserve_ratio = 256 256 32
>> vm.max_map_count = 65530
>> vm.memory_failure_early_kill = 0
>> vm.memory_failure_recovery = 1
>> vm.min_free_kbytes = 65008
>> vm.min_slab_ratio = 5
>> vm.min_unmapped_ratio = 1
>> vm.mmap_min_addr = 4096
>> vm.nr_pdflush_threads = 0
>> vm.numa_zonelist_order = default
>> vm.oom_dump_tasks = 1
>> vm.oom_kill_allocating_task = 0
>> vm.overcommit_kbytes = 0
>> vm.overcommit_memory = 2
>> vm.overcommit_ratio = 50
>> vm.page-cluster = 3
>> vm.panic_on_oom = 0
>> vm.percpu_pagelist_fraction = 0
>> vm.scan_unevictable_pages = 0
>> vm.stat_interval = 1
>> vm.swappiness = 60
>> vm.user_reserve_kbytes = 131072
>> vm.vfs_cache_pressure = 100
>> vm.zone_reclaim_mode = 0
>>
>>
>> select * from pg_stat_activity
>> I've got hundred of entries for that when i'm in production, and I
>> can't
>> paste it here due to confidentiality.
>> By day, it is around 50 millions queries usually. (35% selects ; 55%
>> updates & 5% inserts)
>>
>>
>> lspci | grep -E 'RAID|SCSI|IDE|SATA'
>> 00:1f.2 SATA controller: Intel Corporation C600/X79 series chipset
>> 6-Port SATA AHCI Controller (rev 06)
>> 02:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2208
>> [Thunderbolt] (rev 05)
>> 07:00.0 Serial Attached SCSI controller: Intel Corporation C602 chipset
>> 4-Port SATA Storage Control Unit (rev 06)
>>
>> Thanks
>>
>> Thomas
>>
>>
>> Le 18/05/2015 16:29, Glyn Astill a écrit :
>>>> From: Koray Eyidoğan <korayey(at)gmail(dot)com>
>>>> To: Thomas SIMON <tsimon(at)neteven(dot)com>
>>>> Cc: pgsql-admin(at)postgresql(dot)org
>>>> Sent: Monday, 18 May 2015, 14:51
>>>> Subject: Re: [ADMIN] Performances issues with SSD volume ?
>>>>
>>>>
>>>>
>>>> Hi Thomas,
>>>>
>>>>
>>>> 3.2 kernel may be #1 cause of your I/O load problem:
>>>>
>>>>
>>>>
>> http://www.databasesoup.com/2014/09/why-you-need-to-avoid-linux-kernel-32.html
>>>>
>> https://medium.com/postgresql-talk/benchmarking-postgresql-with-different-linux-kernel-versions-on-ubuntu-lts-e61d57b70dd4
>>>>
>>>>
>>>> Have a nice day.
>>>>
>>>>
>>>> Koray
>>> Likely 3.2 kernel isn't helping, but I think we need much more
>> information before jumping to conclusions.
>>> You say you're IO bound, so some output from sar / iostat / dstat and
>> pg_stat_activity etc before and during the issue would be of use.
>>> Also:
>>> System memory size
>>>
>>> SSD Model numbers and how many
>>> Raid controller
>>> Partition allignments and stripe sizes
>>> Kernel options
>>> Filesystem used and mount options
>>> IO Scheduler
>>> Postgresql version and configuration
>>> Connection pool sizing
>>>
>>>
>>> Perhaps you could thow us the output of some of these:
>>>
>>> fdisk -l
>>> cat /etc/mtab
>>> cat /sys/block/<ssd device>/queue/scheduler
>>> sysctl kernel | grep sched
>>> sysctl vm
>>> select * from pg_stat_activity
>>> select name, setting from pg_settings
>>> lspci | grep -E 'RAID|SCSI|IDE|SATA'
>>
>>
>> --
>> Sent via pgsql-admin mailing list (pgsql-admin(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-admin
>>

In response to

Re: Performances issues with SSD volume ? at 2015-05-20 16:50:29 from Glyn Astill

Responses

Re: Performances issues with SSD volume ? at 2015-05-21 12:57:12 from Glyn Astill

Browse pgsql-admin by date

	From	Date	Subject
Next Message	Thomas SIMON	2015-05-21 12:28:09	Re: Performances issues with SSD volume ?
Previous Message	Graeme B. Bell	2015-05-21 11:21:49	raid writethrough mode (WT), ssds and your DB. (was Performances issues with SSD volume ?)