From: | Ron Peacetree <rjpeace(at)earthlink(dot)net> |
---|---|
To: | Mikael Carneholm <Mikael(dot)Carneholm(at)wirelesscar(dot)com>, pgsql-performance(at)postgresql(dot)org |
Subject: | Re: RAID stripe size question |
Date: | 2006-07-18 12:32:35 |
Message-ID: | 12877326.1153225955854.JavaMail.root@elwamui-huard.atl.sa.earthlink.net |
Views: | Raw Message | Whole Thread | Download mbox | Resend email |
Thread: | |
Lists: | pgsql-performance |
>From: Alex Turner <armtuk(at)gmail(dot)com>
>Sent: Jul 18, 2006 12:21 AM
>To: Ron Peacetree <rjpeace(at)earthlink(dot)net>
>Cc: Mikael Carneholm <Mikael(dot)Carneholm(at)wirelesscar(dot)com>, pgsql-performance(at)postgresql(dot)org
>Subject: Re: [PERFORM] RAID stripe size question
>
>On 7/17/06, Ron Peacetree <rjpeace(at)earthlink(dot)net> wrote:
>>
>> -----Original Message-----
>> >From: Mikael Carneholm <Mikael(dot)Carneholm(at)WirelessCar(dot)com>
>> >Sent: Jul 17, 2006 5:16 PM
>> >To: Ron Peacetree <rjpeace(at)earthlink(dot)net>,
>> pgsql-performance(at)postgresql(dot)org
>> >Subject: RE: [PERFORM] RAID stripe size question
>> >
>> >I use 90% of the raid cache for writes, don't think I could go higher
>> >than that.
>> >Too bad the emulex only has 256Mb though :/
>> >
>> If your RAID cache hit rates are in the 90+% range, you probably would
>> find it profitable to make it greater. I've definitely seen access patterns
>> that benefitted from increased RAID cache for any size I could actually
>> install. For those access patterns, no amount of RAID cache commercially
>> available was enough to find the "flattening" point of the cache percentage
>> curve. 256MB of BB RAID cache per HBA is just not that much for many IO
>> patterns.
>
>90% as in 90% of the RAM, not 90% hit rate I'm imagining.
>
Either way, =particularly= for OLTP-like I/O patterns, the more RAID cache the better unless the IO pattern is completely random. In which case the best you can do is cache the entire sector map of the RAID set and use as many spindles as possible for the tables involved. I've seen high end set ups in Fortune 2000 organizations that look like some of the things you read about on tpc.org: =hundreds= of HDs are used.
Clearly, completely random IO patterns are to be avoided whenever and however possible.
Thankfully, most things can be designed to not have completely random IO and stuff like WAL IO are definitely not random.
The important point here about cache size is that unless you make cache large enough that you see a flattening in the cache behavior, you probably can still use more cache. Working sets are often very large for DB applications.
>>The controller is a FC2143 (
>> http://h71016.www7.hp.com/dstore/MiddleFrame.asp?page=config&ProductLineId=450&FamilyId=1449&BaseId=17621&oi=E9CED&BEID=19701&SBLID=)
>> which uses PCI-E. Don't know how it compares to other controllers, haven't
>> had the time to search for / read any reviews yet.
>> >
>> This is a relatively low end HBA with 1 4Gb FC on it. Max sustained IO on
>> it is going to be ~320MBps. Or ~ enough for an 8 HD RAID 10 set made of
>> 75MBps ASTR HD's.
>>
>> 28 such HDs are =definitely= IO choked on this HBA.
>
>Not they aren't. This is OLTP, not data warehousing. I already posted math
>for OLTP throughput, which is in the order of 8-80MB/second actual data
>throughput based on maximum theoretical seeks/second.
>
WAL IO patterns are not OLTP-like. Neither are most support or decision support IO patterns. Even in an OLTP system, there are usually only a few scenarios and tables where the IO pattern is pessimal.
Alex is quite correct that those few will be the bottleneck on overall system performance if the system's primary function is OLTP-like.
For those few, you dedicate as many spindles and RAID cache as you can afford and as show any performance benefit. I've seen an entire HBA maxed out with cache and as many HDs as would saturate the attainable IO rate dedicated to =1= table (unfortunately SSD was not a viable option in this case).
>>The arithmetic suggests you need a better HBA or more HBAs or both.
>>
>>
>> >>WAL's are basically appends that are written in bursts of your chosen
>> log chunk size and that are almost never read afterwards. Big DB pages and
>> big RAID stripes makes sense for WALs.
>
>
>unless of course you are running OLTP, in which case a big stripe isn't
>necessary, spend the disks on your data parition, because your WAL activity
>is going to be small compared with your random IO.
>
Or to put it another way, the scenarios and tables that have the most random looking IO patterns are going to be the performance bottleneck on the whole system. In an OLTP-like system, WAL IO is unlikely to be your biggest performance issue. As in any other performance tuning effort, you only gain by speeding up the current bottleneck.
>>
>> >According to
>> http://www.pcguide.com/ref/hdd/perf/raid/concepts/perfStripe-c.html, it
>> seems to be the other way around? ("As stripe size is decreased, files are
>> broken into smaller and smaller pieces. This increases the number of drives
>> that an average file will use to hold all the blocks containing the data of
>> that file, theoretically increasing transfer performance, but decreasing
>> positioning performance.")
>> >
>> >I guess I'll have to find out which theory that holds by good ol? trial
>> and error... :)
>> >
>> IME, stripe sizes of 64, 128, or 256 are the most common found to be
>> optimal for most access patterns + SW + FS + OS + HW.
>
>
>New records will be posted at the end of a file, and will only increase the
>file by the number of blocks in the transactions posted at write time.
>Updated records are modified in place unless they have grown too big to be
>in place. If you are updated mutiple tables on each transaction, a 64kb
>stripe size or lower is probably going to be best as block sizes are just
>8kb.
>
Here's where Theory and Practice conflict. pg does not "update" and modify in place in the true DB sense. A pg UPDATE is actually an insert of a new row or rows, !not! a modify in place.
I'm sure Alex knows this and just temporily forgot some of the context of this thread :-)
The append behavior Alex refers to is the best case scenario for pg where a) the table is unfragmented and b) the file segment of say 2GB holding that part of the pg table is not full.
VACUUM and autovacuum are your friend.
>How much data does your average transaction write? How many xacts per
>second, this will help determine how many writes your cache will queue up
>before it flushes, and therefore what the optimal stripe size will be. Of
>course, the fastest and most accurate way is probably just to try different
>settings and see how it works. Alas some controllers seem to handle some
>stripe sizes more effeciently in defiance of any logic.
>
>Work out how big your xacts are, how many xacts/second you can post, and you
>will figure out how fast WAL will be writting. Allocate enough disk for
>peak load plus planned expansion on WAL and then put the rest to
>tablespace. You may well find that a single RAID 1 is enough for WAL (if
>you acheive theoretical performance levels, which it's clear your controller
>isn't).
>
This is very good advice.
>For example, you bonnie++ benchmark shows 538 seeks/second. If on each seek
>one writes 8k of data (one block) then your total throughput to disk is
>538*8k=4304k which is just 4MB/second actual throughput for WAL, which is
>about what I estimated in my calculations earlier. A single RAID 1 will
>easily suffice to handle WAL for this kind of OLTP xact rate. Even if you
>write a full stripe on every pass at 64kb, thats still only 538*64k = 34432k
>or around 34Meg, still within the capability of a correctly running RAID 1,
>and even with your low bonnie scores, within the capability of your 4 disk
>RAID 10.
>
I'd also suggest that you figure out what the max access per sec is for HDs and make sure you are attaining it since this will set the ceiling on your overall system performance.
Like I've said, I've seen organizations dedicate as much HW as could make any difference on a per table basis for important OLTP systems.
>Remember when it comes to OLTP, massive serial throughput is not gonna help
>you, it's low seek times, which is why people still buy 15k RPM drives, and
>why you don't necessarily need a honking SAS/SATA controller which can
>harness the full 1066MB/sec of your PCI-X bus, or more for PCIe. Of course,
>once you have a bunch of OLTP data, people will innevitably want reports on
>that stuff, and what was mainly an OLTP database suddenly becomes a data
>warehouse in a matter of months, so don't neglect to consider that problem
>also.
>
One Warning to expand on Alex's point here.
DO !NOT! use the same table schema and/or DB for your reporting and OLTP.
You will end up with a DBMS that is neither good at reporting nor OLTP.
>Also more RAM on the RAID card will seriously help bolster your transaction
>rate, as your controller can queue up a whole bunch of table writes and
>burst them all at once in a single seek, which will increase your overall
>throughput by as much as an order of magnitude (and you would have to
>increase WAL accordingly therefore).
>
*nods*
>But finally - if your card/cab isn't performing RMA it. Send the damn thing
>back and get something that actualy can do what it should. Don't tolerate
>manufacturers BS!!
>
On this Alex and I are in COMPLETE agreement.
Ron
From | Date | Subject | |
---|---|---|---|
Next Message | Mikael Carneholm | 2006-07-18 13:34:07 | Re: RAID stripe size question |
Previous Message | Gavin Sherry | 2006-07-18 12:22:57 | Re: Temporary table retains old contents on update eventually |