Re: Scaling further up

From: "Aaron Werman" <awerman2(at)hotmail(dot)com>
To: "Marty Scholes" <marty(at)outputservices(dot)com>, <pgsql-performance(at)postgresql(dot)org>
Subject: Re: Scaling further up
Date: 2004-03-14 22:11:22
Message-ID: Law10-OE17LcVq0N1ml000479de@hotmail.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Sorry about not chiming in before - I've been too swamped to think. I agree
with most of the points, but a lot of these posts are interesting and seem
to describe systems from an SA perspective to my DBA-centric view.

----- Original Message -----
From: "Marty Scholes" <marty(at)outputservices(dot)com>
To: <pgsql-performance(at)postgresql(dot)org>
Sent: Wednesday, March 10, 2004 6:29 PM
Subject: Re: [PERFORM] Scaling further up

> I have some suggestions based on my anecdotal experience.
>
> 1. This is a relatively small DB -- the working set will likely be in
> RAM at any moment in time, making read I/O time mostly irrelevant.
>
> 2. The killer will be write times -- specifically log writes. Small and
> heavily synchronized writes, log and data writes, will drag down an
> impressive hardware RAID setup. We run mirrored hardware RAID 5 arrays
> with write back cache and are constantly seeking ways to improve write
> performance. We do a lot of batch processing, though, so we do a lot of
> write I/Os.

My experience with RAID5 for streaming sequential writes is bad. This is
sometimes helped by the hardware caching to cover the cost of the additional
I/Os for striping (write through RAID5 + big cache acts like RAID 1+0 until
you run out of cache). Batch processing is different from high concurrency
transactions because it needs faster volume streaming, while TP is dependant
on the speed of ack'ing (few big writes with less synchronous waits vs. lots
of small writes which serialize everyone). (RAID 3 worked for me in the past
for logging, but I haven't used it in years.)

>
> 3. Be very careful with "battery backed write cache." It usually works
> as advertised. More than once in the past decade I have seen
> spontaneous cache corruption after power losss. The corruption usually
> happens when some admin, including me, has assumed that the cache will
> ALWAYS survive a power failure unblemished and has no "plan B." Make
> sure you have a contingency plan for corruption, or don't enable the
cache.

I agree strongly. There is also the same problem with disk write back cache
and even with SCSI controllers with write through enabled. PITR would help
here. A lot of these problems are due to procedural error post crash.

>
> 4. RAID 10 will likely have bigger stripe sizes on the RAID 0 portion of
> the setup, and might hinder, not help small write I/O performance.

In a high volume system without write caching you are almost always going to
see queuing, which can make the larger buffer mostly irrelevant, if it's not
huge. Write caching thrives on big block sizes (which is a key reason why
Symmetrix doesn't do worse than it does) by reducing I/O counts. Most shops
I've set up or seen use mirroring or RAID 10 for logs. Note also that many
RAID 10 controllers in a non-write cached setup allows having a race between
the two writers, acknowledging when the first of the two completes -
increasing throughput by about 1/4.

>
> 5. Most (almost all) of the I/O time will be due to the access time
> (head seek + head settle + rotational latency) and very little of the
> I/O time will due to data transfer time. In other words, getting drives
> that provide faster transfer rates will barely improve performance. The
> secret is lowering the access time.

True. This is very much a latency story. Even in volume batch, you can see
access time that clearly shows some other system configuration bottleneck
that happens elsewhere before hitting I/O capacity.

>
> 6. A relatively cheap way to drastically drop the access time is to get
> large drive(s) and only use a portion of them for storage. The less
> space used on the drive, the less area the heads need to cover for
> seeks. At one extreme, you could make the partition the size of a
> single cylinder. This would make access time (ignoring OS and
> controller overhead) identical to rotational latency, which is as low as
> 4.2 ms for a cheap 7200 RPM drive.

This is a good strategy for VLDB, and may not be relevant in this case.

Also - big sequential writes and 15K rpm drives, in the case of
writethrough, is a beautiful thing - they look like a manufacturers' demo. A
primary performance role of a RDBMS is to convert random I/O to sequential
(by buffering reads and using a streaming log to defer random writes to
checkpoints). RDBMS's are the prime beneficiaries of the drive speed
improvements - since logging, backups, and copies are about the only things
(ignoring bad perl scripts and find commands) that generate loads of 50+
mB/sec.

/Aaron

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Arjen van der Meijden 2004-03-14 23:21:17 Re: Large CASE-statement is pretty slow?
Previous Message postgres 2004-03-14 21:43:41 Deadlocks...