Quick Links

benchmark woes and XFS options

From:	"mark" <dvlhntr(at)gmail(dot)com>
To:	<pgsql-performance(at)postgresql(dot)org>
Subject:	benchmark woes and XFS options
Date:	2011-08-09 02:06:28
Message-ID:	012901cc5638$f44df9d0$dce9ed70$@com
Views:	Whole Thread \| Raw Message \| Download mbox \| Resend email
Thread:
Lists:	pgsql-performance

Hello PG perf junkies,

Sorry this may get a little long winded. Apologies if the formatting gets
trashed. Also apologies if this double posts. (I originally set it yesterday
with the wrong account and the message is stalled - so my bad there) if
someone is a mod and it's still in the wait queue feel free to remove them.

Short version:
my zcav and dd tests look to get ->CPU bound<-. Yes CPU bound, with junk
numbers. My numbers in ZCAV are flat like and SSD which is odd for 15K rpm
disks. I am not sure what the point of moving further would be given these
unexpected poor numbers. Well I knew 12 disks wasn't going to be something
that impressed me, I am used to 24, but I was expecting about 40-50% better
than what I am getting.

Background:

I have been setting up some new servers for PG and I am getting some odd
numbers with zcav, I am hoping a second set of eyes here can point me in the
right direction. (other tests like bonniee++ (1.03e) and dd also give me odd
(flat and low) numbers)

I will preface this with, yes I bought greg's book. Yes I read it, and it
has helped me in the past, but seem to have hit an oddity.

(hardware,os, and config stuff listed at the end)

Long version:

In the past when dealing with storage I typically see a large gain with
moving from ext3 to XFS, provided I set readahead to 16384 on either
filesystem.

I also see typical down ward trends in the MB/s (expected) and upward trends
in access times (expected) with either file system.

These blades + storage-blades are giving me atypical results .

I am not seeing a dramatic down turn in MB/s in zcav nor am I seeing access
time really increase. (something I have only seen before when I forget to
have readahead set high enough) things are just flat at about 420MB/s in
zcav @ .6ms for access time with XFS and ~470MB/s @.56ms for ext3.

FWIW I get worthless results with zcav and bonnie++ using 1.03 or 1.96
sometimes, which isn't something I have had happen before even though greg
does mention it.

Also when running zcav I will see kswapdX (0 and 1 in my two socket case)
start to eat significant cpu time (~40-50% each), with dd - kswapd and
pdflush become very active as well. This only happens once free mem gets
low. As well zcav or dd looks to get CPU bound at 100% while i/o wait stays
almost at 0.0 most of the time. (iostat -x -d shows util % at 98% though). I
see this with either XFS or ext3. Also when I cat /proc/zoneinfo it looks
like I am getting heavy contention for a single page in DMA while the tests
are running. (see end of email for zoneinfo)

Bonnie is giving me 99% cpu usage reported. Watching it while running it
bounces between 100 and 99. Kswap goes nuts here as well.

I am lead to believe that I may need a 2.6.32 (rhel 6.1) or higher kernel to
see some of the kswapd issues go away. (testing that hopefully later this
week). Maybe that will take care of everything. I don't know yet.

Side note: Setting vm.swappiness to 10 (or 0) doesn't help, although others
on the RHEL support site indicated it did fix kswap issues for them.

Running zcav on my home system (4 disk raid 1+0 3ware controller +BBWC using
ext4 ubunut 2.6.38-8 I don't see zcav near 100% and I see lots of i/o wait
as expected, and my zoneinfo for DMA doesn't sit at 1)

Not going to focus too much on ext3 since I am pretty sure I should be able
to get better numbers from XFS.

With mkfs.xfs I have done some reading and it appears that it can't
automatically read the stripsize (aka stripe size to anyone other than HP)
or the number of disks. So I have been using the following:

mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256

(256K is the default hp stripsize for raid1+0, I have 12 disks in raid 10 so
I used sw=6, agcount of 256 because that is a (random) number I got from
google that seemed in the ball park.)

which gives me:
meta-data=/dev/cciss/c0d0 isize=256 agcount=256, agsize=839936
blks
= sectsz=512 attr=2
data = bsize=4096 blocks=215012774, imaxpct=25
= sunit=64 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

(if I don't specif the agcount or su,sw stuff I get
meta-data=/dev/cciss/c0d0 isize=256 agcount=4, agsize=53753194
blks
= sectsz=512 attr=2
data = bsize=4096 blocks=215012774, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0)

)

So it seems like I should be giving it the extra parameters at mkfs.xfs
time... could someone confirm ? In the past I have never specified the su or
sw or ag groups I have taken the defaults. But since I am getting odd
numbers here I started playing with em. Getting little or no change.

for mounting:
logbufs=8,noatime,nodiratime,nobarrier,inode64,allocsize=16m

(I know that noatime also means nodiratime according xfs.org, but in the
past I seem to get better numbers when having both)

I am using nobarrier because I have a battery backed raid cache and the FAQ
@ XFS.org seems to indicate that is the right choice.

FWIW, if I put sunit and swidth in the mount options it seems to change them
lower (when viewed with xfs_info) so I haven't been putting it in the mount
options.

verify readahead:
blockdev --getra /dev/cciss/c0d0
16384

If anyone wants the benchmark outputs I can send them, but basically zcav
being FLAT for bother MB/s and access time tells me something is wrong. And
it will take days for me to re run all the ones I have done. I didn't save
much once I saw results that don't fit with what I thought I should get.

I haven't done much with pgbench yet as I figure its pointless to move on
while the raw I/O numbers look off to me. At that time I am going to make
the call between wal on the OS raid 1 or going to 10 data disks and 2 os and
2 wal.

I have gone up to 2.6.18-27(something, wanna say 2 or 4) to see if the issue
went away, it didn't. I have gone back to 2.6.18-238.5 and put in a new
CCISS driver directly from HP, and the issue also does not go away. People
at work are thinking it might kernel bug that we have somehow never notice
before which is why we are going to look at RHEL 6.1. we tried a 5.3 kernel
that someone on rh bugzilla said didn't have the issue but this blade had a
fit with it - no network, lots of other stuff not working and then it kernel
panic'd so we quickly gave up on that...

We may try and shoehorn in the 6.1 kernel and a few dependencies as well.
Moving to RHEL 6.1. will mean a long test period before it can go into prod
and we want to get this new hardware in sooner than that can be done. (even
with all it's problems its probably still faster than what it is replacing
just from the 48GB of ram and 3 gen newer CPUS)

Hardware and config stuff as it sits right now.

Blade Hardware:
ProLiant BL460c G7 (bios power flag set to high performance)
2 intel 5660 cpus. (HT left on)
48GB of ram (12x4GB @ 1333MHz)
Smart Array P410i (Embedded)
Points of interest from hpacucli -
- Hardware Revision: Rev C
- Firmware Version: 3.66
- Cache Board Present: True
- Elevator Sort: Enabled
- Cache Status: OK
- Cache Backup Power Source: Capacitors
- Battery/Capacitor Count: 1
- Battery/Capacitor Status: OK
- Total Cache Size: 512 MB
- Accelerator Ratio: 25% Read / 75% Write
- Strip Size: 256 KB
- 2x 15K RPM 146GB 6Gbps SAS in raid 1 for OS (ext3)
- Array Accelerator: Enabled
- Status: OK
- drives firmware = HPD5

Blade Storage subsystem:
HP SB2200 (12 disk 15K )

Points of interest from hpacucli

Smart Array P410i in Slot 3
Controller Status: OK
Hardware Revision: Rev C
Firmware Version: 3.66
Elevator Sort: Enabled
Wait for Cache Room: Disabled
Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 25% Read / 75% Write
Drive Write Cache: Disabled
Total Cache Size: 1024 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Capacitors
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: True

Logical Drive: 1
Size: 820.2 GB
Fault Tolerance: RAID 1+0
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 256 KB
Status: OK
Array Accelerator: Enabled
Disk Name: /dev/cciss/c0d0
Mount Points: /raid 820.2 GB
OS Status: LOCKED

12 drives in Raid 1+0, using XFS.

OS:
OS: RHEL 5.6 (2.6.18-238.9.1.el5)
Database use: PG 9.0.2 for OLTP.

CCISS info:
filename:
/lib/modules/2.6.18-238.9.1.el5/kernel/drivers/block/cciss.ko
version: 3.6.22-RH1
description: Driver for HP Controller SA5xxx SA6xxx version 3.6.22-RH1
author: Hewlett-Packard Company

XFS INFO:
xfsdump-2.2.48-3.el5
xfsprogs-2.10.2-7.el5

XFS mkfs string:
mkfs.xfs -b size=4k -d su=256k,sw=6,agcount=256

mkfs.xfs output:
meta-data=/dev/cciss/c0d0 isize=256 agcount=256, agsize=839936
blks
= sectsz=512 attr=2
data = bsize=4096 blocks=215012774, imaxpct=25
= sunit=64 swidth=384 blks
naming =version 2 bsize=4096 ascii-ci=0
log =internal log bsize=4096 blocks=32768, version=2
= sectsz=512 sunit=64 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

head of ZONEINFO while zcav is running and kswap is going nuts:
the min,low,high of 1 seems odd to me. On other systems these get above 1.

Node 0, zone DMA
pages free 2493
min 1
low 1
high 1
active 0
inactive 0
scanned 0 (a: 3 i: 3)
spanned 4096
present 2393
nr_anon_pages 0
nr_mapped 1
nr_file_pages 0
nr_slab 0
nr_page_table_pages 0
nr_dirty 0
nr_writeback 0
nr_unstable 0
nr_bounce 0
numa_hit 0
numa_miss 0
numa_foreign 0
numa_interleave 0
numa_local 0
numa_other 0
protection: (0, 3822, 24211, 24211)
pagesets
all_unreclaimable: 1
prev_priority: 12
start_pfn: 0

numastat (probably worthless since I have been pounding on this box for a
while before capturing it)

node0 node1
numa_hit 3126413031 247696913
numa_miss 95489353 2781917287
numa_foreign 2781917287 95489353
interleave_hit 81178 97872
local_node 3126297257 247706110
other_node 95605127 2781908090

Responses

Re: benchmark woes and XFS options at 2011-08-09 03:42:19 from Greg Smith

Browse pgsql-performance by date

	From	Date	Subject
Next Message	Greg Smith	2011-08-09 03:42:19	Re: benchmark woes and XFS options
Previous Message	Kevin Grittner	2011-08-08 18:35:11	Re: PostgreSQL 9.0.1 on Windows performance tunning help please