Re: RAID stripe size question

From: "Alex Turner" <armtuk(at)gmail(dot)com>
To: "Ron Peacetree" <rjpeace(at)earthlink(dot)net>
Cc: "Mikael Carneholm" <Mikael(dot)Carneholm(at)wirelesscar(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: RAID stripe size question
Date: 2006-07-18 04:21:51
Message-ID: 33c6269f0607172121i1b6610b1j3fc686d4132f880b@mail.gmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

On 7/17/06, Ron Peacetree <rjpeace(at)earthlink(dot)net> wrote:
>
> -----Original Message-----
> >From: Mikael Carneholm <Mikael(dot)Carneholm(at)WirelessCar(dot)com>
> >Sent: Jul 17, 2006 5:16 PM
> >To: Ron Peacetree <rjpeace(at)earthlink(dot)net>,
> pgsql-performance(at)postgresql(dot)org
> >Subject: RE: [PERFORM] RAID stripe size question
> >
> >>15Krpm HDs will have average access times of 5-6ms. 10Krpm ones of
> 7-8ms.
> >
> >Average seek time for that disk is listed as 4.9ms, maybe sounds a bit
> optimistic?
> >
> Ah, the games vendors play. "average seek time" for a 10Krpm HD may very
> well be 4.9ms. However, what matters to you the user is "average =access=
> time". The 1st is how long it takes to position the heads to the correct
> track. The 2nd is how long it takes to actually find and get data from a
> specified HD sector.
>
> >> 28HDs as above setup as 2 RAID 10's => ~75MBps*5= ~375MB/s, ~75*9=
> ~675MB/s.
> >
> >I guess it's still limited by the 2Gbit FC (192Mb/s), right?
> >
> No. A decent HBA has multiple IO channels on it. So for instance Areca's
> ARC-6080 (8/12/16-port 4Gbps Fibre-to-SATA ll Controller) has 2 4Gbps FCs in
> it (...and can support up to 4GB of BB cache!). Nominally, this card can
> push 8Gbps= 800MBps. ~600-700MBps is the RW number.
>
> Assuming ~75MBps ASTR per HD, that's ~ enough bandwidth for a 16 HD RAID
> 10 set per ARC-6080.
>
> >>Very, very few RAID controllers can do >= 1GBps One thing that help
> greatly with
> >>bursty IO patterns is to up your battery backed RAID cache as high as
> you possibly
> >>can. Even multiple GBs of BBC can be worth it.
> >>Another reason to have multiple controllers ;-)
> >
> >I use 90% of the raid cache for writes, don't think I could go higher
> than that.
> >Too bad the emulex only has 256Mb though :/
> >
> If your RAID cache hit rates are in the 90+% range, you probably would
> find it profitable to make it greater. I've definitely seen access patterns
> that benefitted from increased RAID cache for any size I could actually
> install. For those access patterns, no amount of RAID cache commercially
> available was enough to find the "flattening" point of the cache percentage
> curve. 256MB of BB RAID cache per HBA is just not that much for many IO
> patterns.

90% as in 90% of the RAM, not 90% hit rate I'm imagining.

>The controller is a FC2143 (
> http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID=)
> which uses PCI-E. Don't know how it compares to other controllers, haven't
> had the time to search for / read any reviews yet.
> >
> This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on
> it is going to be ~320MBps. Or ~ enough for an 8 HD RAID 10 set made of
> 75MBps ASTR HD's.
>
> 28 such HDs are =definitely= IO choked on this HBA.

Not they aren't. This is OLTP, not data warehousing. I already posted math
for OLTP throughput, which is in the order of 8-80MB/second actual data
throughput based on maximum theoretical seeks/second.

The arithmatic suggests you need a better HBA or more HBAs or both.
>
>
> >>WAL's are basically appends that are written in bursts of your chosen
> log chunk size and that are almost never read afterwards. Big DB pages and
> big RAID stripes makes sense for WALs.

unless of course you are running OLTP, in which case a big stripe isn't
necessary, spend the disks on your data parition, because your WAL activity
is going to be small compared with your random IO.

>
> >According to
> http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it
> seems to be the other way around? ("As stripe size is decreased, files are
> broken into smaller and smaller pieces. This increases the number of drives
> that an average file will use to hold all the blocks containing the data of
> that file, theoretically increasing transfer performance, but decreasing
> positioning performance.")
> >
> >I guess I'll have to find out which theory that holds by good ol� trial
> and error... :)
> >
> IME, stripe sizes of 64, 128, or 256 are the most common found to be
> optimal for most access patterns + SW + FS + OS + HW.

New records will be posted at the end of a file, and will only increase the
file by the number of blocks in the transactions posted at write time.
Updated records are modified in place unless they have grown too big to be
in place. If you are updated mutiple tables on each transaction, a 64kb
stripe size or lower is probably going to be best as block sizes are just
8kb. How much data does your average transaction write? How many xacts per
second, this will help determine how many writes your cache will queue up
before it flushes, and therefore what the optimal stripe size will be. Of
course, the fastest and most accurate way is probably just to try different
settings and see how it works. Alas some controllers seem to handle some
stripe sizes more effeciently in defiance of any logic.

Work out how big your xacts are, how many xacts/second you can post, and you
will figure out how fast WAL will be writting. Allocate enough disk for
peak load plus planned expansion on WAL and then put the rest to
tablespace. You may well find that a single RAID 1 is enough for WAL (if
you acheive theoretical performance levels, which it's clear your controller
isn't).

For example, you bonnie++ benchmark shows 538 seeks/second. If on each seek
one writes 8k of data (one block) then your total throughput to disk is
538*8k=4304k which is just 4MB/second actual throughput for WAL, which is
about what I estimated in my calculations earlier. A single RAID 1 will
easily suffice to handle WAL for this kind of OLTP xact rate. Even if you
write a full stripe on every pass at 64kb, thats still only 538*64k = 34432k
or around 34Meg, still within the capability of a correctly running RAID 1,
and even with your low bonnie scores, within the capability of your 4 disk
RAID 10.

Remember when it comes to OLTP, massive serial throughput is not gonna help
you, it's low seek times, which is why people still buy 15k RPM drives, and
why you don't necessarily need a honking SAS/SATA controller which can
harness the full 1066MB/sec of your PCI-X bus, or more for PCIe. Of course,
once you have a bunch of OLTP data, people will innevitably want reports on
that stuff, and what was mainly an OLTP database suddenly becomes a data
warehouse in a matter of months, so don't neglect to consider that problem
also.

Also more RAM on the RAID card will seriously help bolster your transaction
rate, as your controller can queue up a whole bunch of table writes and
burst them all at once in a single seek, which will increase your overall
throughput by as much as an order of magnitude (and you would have to
increase WAL accordingly therefore).

But finally - if your card/cab isn't performing RMA it. Send the damn thing
back and get something that actualy can do what it should. Don't tolerate
manufacturers BS!!

Alex

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Rusty Conover 2006-07-18 06:42:58 Temporary table retains old contents on update eventually causing slow temp file usage.
Previous Message Ron Peacetree 2006-07-18 03:07:55 Re: RAID stripe size question