Re: [PERFORM] Quad processor options - summary

From: James Thornton <james(at)jamesthornton(dot)com>
To: Bjoern Metzdorf <bm(at)turtle-entertainment(dot)de>
Cc: pgsql-performance(at)postgresql(dot)org, pgsql-admin(at)postgresql(dot)org
Subject: Re: [PERFORM] Quad processor options - summary
Date: 2004-05-13 22:50:45
Message-ID: 40A3FBC5.8090301@jamesthornton.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-admin pgsql-performance

Bjoern Metzdorf wrote:

>> You might also consider configuring the Postgres data drives for a
>> RAID 10 SAME configuration as described in the Oracle paper "Optimal
>> Storage Configuration Made Easy"
>> (http://otn.oracle.com/deploy/availability/pdf/oow2000_same.pdf) Has
>> anyone delved into this before?
>
> Ok, if I understand it correctly the papers recommends the following:
>
> 1. Get many drives and stripe them into a RAID0 with a stripe width of
> 1MB. I am not quite sure if this stripe width is to be controlled at the
> application level (does postgres support this?) or if e.g. the "chunk
> size" of the linux software driver is meant. Normally a chunk size of
> 4KB is recommended, so 1MB sounds fairly large.
>
> 2. Mirror your RAID0 and get a RAID10.

Don't use RAID 0+1 -- use RAID 1+0 instead. Performance is the same, but
if a disk fails in a RAID 0+1 configuration, you are left with a RAID 0
array. In a RAID 1+0 configuration, multiple disks can fail.

A few weeks ago I called LSI asking about the Dell PERC4-Di card, which
is actually an LSI Megaraid 320-2. Dell's documentation said that its
support for RAID 10 was in the form of RAID-1 concatenated, but LSI said
that this is incorrect and that it supports RAID 10 proper.

> 3. Use primarily the fast, outer regions of your disks. In practice this
> might be achieved by putting only half of the disk (the outer half) into
> your stripe set. E.g. put only the outer 18GB of your 36GB disks into
> the stripe set.

You can still use the inner-half of the drives, just relegate it to
less-frequently accessed data.

You also need to consider the filesystem.

SGI and IBM did a detailed study on Linux filesystem performance, which
included XFS, ext2, ext3 (various modes), ReiserFS, and JFS, and the
results are presented in a paper entitled "Filesystem Performance and
Scalability in Linux 2.4.17"
(http://oss.sgi.com/projects/xfs/papers/filesystem-perf-tm.pdf)

The scaling and load are key factors when selecting a filesystem. Since
Postgres data is stored in large files, ReiserFS is not the ideal choice
since it has been optimized for small files. XFS is probably the best
choice for a database server running on a quad processor box.

However, Dr. Bert Scalzo of Quest argues that general file system
benchmarks aren't ideal for benchmarking a filesystem for a database
server. In a paper entitled "Tuning an Oracle8i Database running Linux"
(http://otn.oracle.com/oramag/webcolumns/2002/techarticles/scalzo_linux02.html)
he says, "The trouble with these tests-for example, Bonnie, Bonnie++,
Dbench, Iobench, Iozone, Mongo, and Postmark-is that they are basic file
system throughput tests, so their results generally do not pertain in
any meaningful fashion to the way relational database systems access
data files." Instead he suggests using these two well-known and widely
accepted database benchmarks:

* AS3AP: a scalable, portable ANSI SQL relational database benchmark
that provides a comprehensive set of tests of database-processing power;
has built-in scalability and portability for testing a broad range of
systems; minimizes human effort in implementing and running benchmark
tests; and provides a uniform, metric, straightforward interpretation of
the results.

* TPC-C: an online transaction processing (OLTP) benchmark that involves
a mix of five concurrent transactions of various types and either
executes completely online or queries for deferred execution. The
database comprises nine types of tables, having a wide range of record
and population sizes. This benchmark measures the number of transactions
per second.

In the paper, Scalzo benchmarks ext2, ext3, ReiserFS, JFS, but not XFS.
Surprisingly ext3 won, but Scalzo didn't address scaling/load. The
results are surprising because most think ext3 is just ext2 with
journaling, thus having extra overhead from journaling.

If you read papers on ext3, you'll discover that has some optimizations
that reduce disk head movement. For example, Daniel Robbins' "Advanced
filesystem implementor's guide, Part 7: Introducing ext3"
(http://www-106.ibm.com/developerworks/library/l-fs7/) says:

"The approach that the [ext3 Journaling Block Device layer API] uses is
called physical journaling, which means that the JBD uses complete
physical blocks as the underlying currency for implementing the
journal...the use of full blocks allows ext3 to perform some additional
optimizations, such as "squishing" multiple pending IO operations within
a single block into the same in-memory data structure. This, in turn,
allows ext3 to write these multiple changes to disk in a single write
operation, rather than many. In addition, because the literal block data
is stored in memory, little or no massaging of the in-memory data is
required before writing it to disk, greatly reducing CPU overhead."

I suspect that less writes may be the key factor in ext3 winning
Scalzo's DB benchmark. But as I said, Scalzo didn't benchmark XFS and he
didn't address scaling.

XFS has a feature called delayed allocation that reduces IO
(http://www-106.ibm.com/developerworks/library/l-fs9/) and it scales
much better than ext3 so while I haven't tested it, I suspect that it
may be the ideal choice for large Linux DB servers:

"XFS handles allocation by breaking it into a two-step process. First,
when XFS receives new data to be written, it records the pending
transaction in RAM and simply reserves an appropriate amount of space on
the underlying filesystem. However, while XFS reserves space for the new
data, it doesn't decide what filesystem blocks will be used to store the
data, at least not yet. XFS procrastinates, delaying this decision to
the last possible moment, right before this data is actually written to
disk.

By delaying allocation, XFS gains many opportunities to optimize write
performance. When it comes time to write the data to disk, XFS can now
allocate free space intelligently, in a way that optimizes filesystem
performance. In particular, if a bunch of new data is being appended to
a single file, XFS can allocate a single, contiguous region on disk to
store this data. If XFS hadn't delayed its allocation decision, it may
have unknowingly written the data into multiple non-contiguous chunks,
reducing write performance significantly. But, because XFS delayed its
allocation decision, it was able to write the data in one fell swoop,
improving write performance as well as reducing overall filesystem
fragmentation.

Delayed allocation also has another performance benefit. In situations
where many short-lived temporary files are created, XFS may never need
to write these files to disk at all. Since no blocks are ever allocated,
there's no need to deallocate any blocks, and the underlying filesystem
metadata doesn't even get touched."

For further study, I have compiled a list of Linux filesystem resources
at: http://jamesthornton.com/hotlist/linux-filesystems/.

--

James Thornton
______________________________________________________
Internet Business Consultant, http://jamesthornton.com

In response to

Browse pgsql-admin by date

  From Date Subject
Next Message Hadley Willan 2004-05-13 22:59:16 Re: [PERFORM] Quad processor options - summary
Previous Message John D. Tiedeman 2004-05-13 22:41:24 Problems installing 7.4.2

Browse pgsql-performance by date

  From Date Subject
Next Message Hadley Willan 2004-05-13 22:59:16 Re: [PERFORM] Quad processor options - summary
Previous Message Bjoern Metzdorf 2004-05-13 21:53:31 Re: [PERFORM] Quad processor options - summary