Re: How to improve db performance with $7K?

From: Alan Stange <stange(at)rentec(dot)com>
To: PFC <lists(at)boutiquenumerique(dot)com>
Cc: Kevin Brown <kevin(at)sysexperts(dot)com>, pgsql-performance(at)postgresql(dot)org
Subject: Re: How to improve db performance with $7K?
Date: 2005-04-15 13:11:48
Message-ID: 425FBD94.3060600@rentec.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

PFC wrote:

>
>
>> My argument is that a sufficiently smart kernel scheduler *should*
>> yield performance results that are reasonably close to what you can
>> get with that feature. Perhaps not quite as good, but reasonably
>> close. It shouldn't be an orders-of-magnitude type difference.
>
>
> And a controller card (or drive) has a lot less RAM to use as a
> cache / queue for reordering stuff than the OS has, potentially the
> OS can us most of the available RAM, which can be gigabytes on a big
> server, whereas in the drive there are at most a few tens of
> megabytes...
>
> However all this is a bit looking at the problem through the wrong
> end. The OS should provide a multi-read call for the applications to
> pass a list of blocks they'll need, then reorder them and read them
> the fastest possible way, clustering them with similar requests from
> other threads.
>
> Right now when a thread/process issues a read() it will block
> until the block is delivered to this thread. The OS does not know if
> this thread will then need the next block (which can be had very
> cheaply if you know ahead of time you'll need it) or not. Thus it
> must make guesses, read ahead (sometimes), etc...

All true. Which is why high performance computing folks use
aio_read()/aio_write() and load up the kernel with all the requests they
expect to make.

The kernels that I'm familiar with will do read ahead on files based on
some heuristics: when you read the first byte of a file the OS will
typically load up several pages of the file (depending on file size,
etc). If you continue doing read() calls without a seek() on the file
descriptor the kernel will get the hint that you're doing a sequential
read and continue caching up the pages ahead of time, usually using the
pages you just read to hold the new data so that one isn't bloating out
memory with data that won't be needed again. Throw in a seek() and the
amount of read ahead caching may be reduced.

One point that is being missed in all this discussion is that the file
system also imposes some constraints on how IO's can be done. For
example, simply doing a write(fd, buf, 100000000) doesn't emit a stream
of sequential blocks to the drives. Some file systems (UFS was one)
would force portions of large files into other cylinder groups so that
small files could be located near the inode data, thus avoiding/reducing
the size of seeks. Similarly, extents need to be allocated and the
bitmaps recording this data usually need synchronous updates, which will
require some seeks, etc. Not to mention the need to update inode data,
etc. Anyway, my point is that the allocation policies of the file
system can confuse the situation.

Also, the seek times one sees reported are an average. One really needs
to look at the track-to-track seek time and also the "full stoke" seek
times. It takes a *long* time to move the heads across the whole
platter. I've seen people partition drives to only use small regions of
the drives to avoid long seeks and to better use the increased number of
bits going under the head in one rotation. A 15K drive doesn't need to
have a faster seek time than a 10K drive because the rotational speed is
higher. The average seek time might be faster just because the 15K
drives are smaller with fewer number of cylinders.

-- Alan

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Joel Fradkin 2005-04-15 13:12:25 Re: speed of querry?
Previous Message Christopher Browne 2005-04-15 12:57:33 Re: plperl vs plpgsql