Re: SSD + RAID

From: Greg Smith <greg(at)2ndquadrant(dot)com>
To: Scott Carey <scott(at)richrelevance(dot)com>
Cc: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>, Laszlo Nagy <gandalf(at)shopzeus(dot)com>, Ivan Voras <ivoras(at)freebsd(dot)org>, "pgsql-performance(at)postgresql(dot)org" <pgsql-performance(at)postgresql(dot)org>
Subject: Re: SSD + RAID
Date: 2009-11-19 21:04:29
Message-ID: 4B05B2DD.2050201@2ndquadrant.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

Scott Carey wrote:
> Have PG wait a half second (configurable) after the checkpoint fsync()
> completes before deleting/ overwriting any WAL segments. This would be a
> trivial "feature" to add to a postgres release, I think. Actually, it
> already exists! Turn on log archiving, and have the script that it runs after a checkpoint sleep().
>
That won't help. Once the checkpoint is done, the problem isn't just
that the WAL segments are recycled. The server isn't going to use them
even if they were there. The reason why you can erase/recycle them is
that you're doing so *after* writing out a checkpoint record that says
you don't have to ever look at them again. What you'd actually have to
do is hack the server code to insert that delay after every fsync--there
are none that you can cheat on and not introduce a corruption
possibility. The whole WAL/recovery mechanism in PostgreSQL doesn't
make a lot of assumptions about what the underlying disk has to actually
do beyond the fsync requirement; the flip side to that robustness is
that it's the one you can't ever violate safely.
> BTW, the information I have seen indicates that the write cache is 256K on
> the Intel drives, the 32MB/64MB of other RAM is working memory for the drive
> block mapping / wear leveling algorithms (tracking 160GB of 4k blocks takes
> space).
>
Right. It's not used like the write-cache on a regular hard drive,
where they're buffering 8MB-32MB worth of writes just to keep seek
overhead down. It's there primarily to allow combining writes into
large chunks, to better match the block size of the underlying SSD flash
cells (128K). Having enough space for two full cells allows spooling
out the flash write to a whole block while continuing to buffer the next
one.

This is why turning the cache off can tank performance so badly--you're
going to be writing a whole 128K block no matter what if it's force to
disk without caching, even if it's just to write a 8K page to it.
That's only going to reach 1/16 of the usual write speed on single page
writes. And that's why you should also be concerned at whether
disabling the write cache impacts the drive longevity, lots of small
writes going out in small chunks is going to wear flash out much faster
than if the drive is allowed to wait until it's got a full sized block
to write every time.

The fact that the cache is so small is also why it's harder to catch the
drive doing the wrong thing here. The plug test is pretty sensitive to
a problem when you've got megabytes worth of cached writes that are
spooling to disk at spinning hard drive speeds. The window for loss on
a SSD with no seek overhead and only a moderate number of KB worth of
cached data is much, much smaller. Doesn't mean it's gone though. It's
a shame that the design wasn't improved just a little bit; a cheap
capacitor and blocking new writes once the incoming power dropped is all
it would take to make these much more reliable for database use. But
that would raise the price, and not really help anybody but the small
subset of the market that cares about durable writes.
> 4: Yet another solution: The drives DO adhere to write barriers properly.
> A filesystem that used these in the process of fsync() would be fine too.
> So XFS without LVM or MD (or the newer versions of those that don't ignore
> barriers) would work too.
>
If I really trusted anything beyond the very basics of the filesystem to
really work well on Linux, this whole issue would be moot for most of
the production deployments I do. Ideally, fsync would just push out the
minimum of what's needed, it would call the appropriate write cache
flush mechanism the way the barrier implementation does when that all
works, life would be good. Alternately, you might even switch to using
O_SYNC writes instead, which on a good filesystem implementation are
both accelerated and safe compared to write/fsync (I've seen that work
as expected on Vertias VxFS for example).

Meanwhile, in the actual world we live, patches that make writes more
durable by default are dropped by the Linux community because they tank
performance for too many types of loads, I'm frightened to turn on
O_SYNC at all on ext3 because of reports of corruption on the lists
here, fsync does way more work than it needs to, and the way the
filesystem and block drivers have been separated makes it difficult to
do any sort of device write cache control from userland. This is why I
try to use the simplest, best tested approach out there whenever possible.

--
Greg Smith 2ndQuadrant Baltimore, MD
PostgreSQL Training, Services and Support
greg(at)2ndQuadrant(dot)com www.2ndQuadrant.com

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Greg Smith 2009-11-19 21:10:47 Re: SSD + RAID
Previous Message Brad Nicholson 2009-11-19 18:57:51 Re: SSD + RAID