Re: SCSI vs. IDE performance test

From: "Rick Gigger" <rick(at)alpinenetworking(dot)com>
To: <pgsql-general(at)postgresql(dot)org>
Subject: Re: SCSI vs. IDE performance test
Date: 2003-10-28 00:49:55
Message-ID: 010e01c39ced$685c0870$0700a8c0@trogdor
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

Thanks! Now it is much, much more clear. It leaves me with a few
additional questions though.

Question 1:
"we have no portable means of expressing that exact constraint to the
kernel"
Does this mean that specific operating systems have a better way of dealing
with this? Which ones and how? I'm guessing that it couldn't make to big
of a performance difference or it would probably be implemented already.

Question 2:
Do serial ATA drives suffer from the same issue?

----- Original Message -----
From: "Tom Lane" <tgl(at)sss(dot)pgh(dot)pa(dot)us>
To: "Rick Gigger" <rick(at)alpinenetworking(dot)com>
Cc: <pgsql-general(at)postgresql(dot)org>
Sent: Monday, October 27, 2003 5:05 PM
Subject: Re: [GENERAL] SCSI vs. IDE performance test

> "Rick Gigger" <rick(at)alpinenetworking(dot)com> writes:
> > ahhh. "lies about write order" is the phrase that I was looking for.
That
> > seemed to make sense but I didn't know if I could go directly from
"lying
> > about fsync" to that. Obviously I don't understand exactly what fsync
is
> > doing.
>
> What we actually care about is write order: WAL entries have to hit the
> platter before the corresponding data-file changes do. Unfortunately we
> have no portable means of expressing that exact constraint to the
> kernel. We use fsync() (or related constructs) instead: issue the WAL
> writes, fsync the WAL file, then issue the data-file writes. This
> constrains the write ordering more than is really needed, but it's the
> best we can do in a portable Unix application.
>
> The problem is that the kernel thinks fsync is done when the disk drive
> reports the writes are complete. When we say a drive lies about this,
> we mean it accepts a sector of data into its on-board RAM and then
> immediately claims write-complete, when in reality the data hasn't hit
> the platter yet and will be lost if power dies before the drive gets
> around to writing it.
>
> So we can have a scenario where we think WAL is down to disk and go
> ahead with issuing data-file writes. These will also be shoved over to
> the drive and stored in its on-board RAM. Now the drive has multiple
> sectors pending write in its buffers. If it chooses to write these in
> some order other than the order they were given to it, it could write
> the data file updates to disk first. If power drops *now*, we lose,
> because the data files are inconsistent and there's no WAL entry to tell
> us to fix it.
>
> Got it? It's really the combination of "lie about write completion" and
> "write pending sectors out of order" that can mess things up.
>
> The reason IDE drives have to do this for reasonable performance is that
> the IDE interface is single-threaded: you can only have one read or
> write in process at a time, from the point of view of the
> kernel-to-drive interface. But in order to schedule reads and writes in
> a way that makes sense physically (minimizes seeks), the drive has to
> have multiple read and write requests pending that it can pick and
> choose from. The only possibility to do that in the IDE world is to
> let a write "complete" in interface terms before it's really done ...
> that is, lie.
>
> The reason SCSI drives do *not* do this is that the SCSI interface is
> logically multi-threaded: you can have multiple reads or writes pending
> at once. When you want to write on a SCSI drive, you send over a
> command that says "write this data at this sector". Sometime later the
> drive sends back a status report "yessir boss, I done did that write".
> Similarly, a read consists of a command "read this sector", followed
> sometime later by a response that delivers the requested data. But you
> can send other commands to read or write other sectors meanwhile, and
> the drive is free to reorder them to suit its convenience. So in the
> SCSI world, there is no need for the drive to lie in order to do its own
> read/write scheduling. The kernel knows the truth about whether a given
> sector has hit disk, and so it won't conclude that the WAL file has been
> completely fsync'd until it really is all down to the platter.
>
> This is also why SCSI disks shine on the read side when you have lots of
> processes doing reads: in an IDE drive, there is no way for the drive to
> satisfy read requests in any order but the one they're issued in. If the
> kernel guesses wrong about the best ordering for a set of read requests,
> then everybody waits for the seeks needed to get the earlier processes'
> data. A SCSI drive can fetch the "nearest" data first, and then that
> requester is freed to make progress in the CPU while the other guys wait
> for their longer seeks. There's no win here with a single active user
> process (since it probably wants specific data in a specific order), but
> it's a huge win if lots of processes are making unrelated read requests.
>
> Clear now?
>
> (In a previous lifetime I wrote SCSI disk driver code ...)
>
> regards, tom lane
>

In response to

Responses

Browse pgsql-general by date

  From Date Subject
Next Message Gerard M. Operana 2003-10-28 01:01:20 unsubscribe
Previous Message Ron Johnson 2003-10-28 00:22:08 Re: SCSI vs. IDE performance test