Re: rough benchmarks, sata vs. ssd

From: CSS <css(at)morefoo(dot)com>
To: Ivan Voras <ivoras(at)freebsd(dot)org>
Cc: pgsql-performance(at)postgresql(dot)org
Subject: Re: rough benchmarks, sata vs. ssd
Date: 2012-02-13 21:49:38
Message-ID: 2813ABC5-803E-408A-8FD4-3D3C22014BFD@morefoo.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance

For the top-post scanners, I updated the ssd test to include
changing the zfs recordsize to 8k.

On Feb 11, 2012, at 1:35 AM, CSS wrote:

>
> On Feb 3, 2012, at 6:23 AM, Ivan Voras wrote:
>
>> On 31/01/2012 09:07, CSS wrote:
>>> Hello all,
>>>
>>> Just wanted to share some results from some very basic benchmarking
>>> runs comparing three disk configurations on the same hardware:
>>>
>>> http://morefoo.com/bench.html
>>
>> That's great!
>
> Thanks. I did spend a fair amount of time on it. It was also a
> good excuse to learn a little about gnuplot, which I used to draw
> the (somewhat oddly combined) system stats. I really wanted to see
> IO and CPU info over the duration of a test even if I couldn't
> really know what part of the test was running. Don't ask me why
> iostat sometimes shows greater than 100% in the "busy" column
> though. It is in the raw iostat output I used to create the graphs.
>
>>
>>> *Tyan B7016 mainboard w/onboard LSI SAS controller
>>> *2x4 core xeon E5506 (2.13GHz)
>>> *64GB ECC RAM (8GBx8 ECC, 1033MHz)
>>> *2x250GB Seagate SATA 7200.9 (ST3250824AS) drives (yes, old and slow)
>>> *2x160GB Intel 320 SSD drives
>>
>> It shows that you can have large cheap SATA drives and small fast SSD-s, and up to a point have best of both worlds. Could you send me (privately) a tgz of the results (i.e. the pages+images from the above URL), I'd like to host them somewhere more permanently.
>
> Sent offlist, including raw vmstat, iostat and zpool iostat output.
>
>>
>>> The ZIL is a bit of a cheat, as it allows you to throw all the
>>> synchronous writes to the SSD
>>
>> This is one of the main reasons it was made. It's not a cheat, it's by design.
>
> I meant that only in the best way. Some of my proudest achievements
> are cheats. :)
>
> It's a clever way of moving cache to something non-volatile and
> providing a fallback, although the fallback would be insanely slow
> in comparison.
>
>>
>>> Why ZFS? Well, we adopted it pretty early for other tasks and it
>>> makes a number of tasks easy. It's been stable for us for the most
>>> part and our latest wave of boxes all use cheap SATA disks, which
>>> gives us two things - a ton of cheap space (in 1U) for snapshots and
>>> all the other space-consuming toys ZFS gives us, and on this cheaper
>>> disk type, a guarantee that we're not dealing with silent data
>>> corruption (these are probably the normal fanboy talking points).
>>> ZFS snapshots are also a big time-saver when benchmarking. For our
>>> own application testing I load the data once, shut down postgres,
>>> snapshot pgsql + the app homedir and start postgres. After each run
>>> that changes on-disk data, I simply rollback the snapshot.
>>
>> Did you tune ZFS block size for the postgresql data directory (you'll need to re-create the file system to do this)? When I investigated it in the past, it really did help performance.
>

Well now I did, added the results to
http://ns.morefoo.com/bench.html and it looks like there's
certainly an improvement. That's with the only change from the
previous test being to copy the postgres data dir, wipe the
original, set the zfs recordsize to 8K (default is 128K), and then
copy the data dir back.

Things that stand out on first glance:

-at a scaling factor of 10 or greater, there is a much more gentle
decline in TPS than with the default zfs recordsize
-on the raw *disk* IOPS graph, I now see writes peaking at around
11K/second compared to 1.5K/second.
-on the zpool iostat graph, I do not see those huge write peaks,
which is a bit confusing
-on both iostat graphs, I see the datapoints look more scattered
with the 8K recordsize

Any comments are certainly welcome. I understand 8K recordsize
should perform better since that's the size of the chunks of data
postgresql is dealing with, but the effects on the system graphs
are interesting and I'm not quite following how it all relates.

I wonder if the recordsize impacts the ssd write amplification at
all...

Thanks,

Charles

> I actually did not. A year or so ago I was doing some basic tests
> on cheap SATA drives with ZFS and at least with pgbench, I could see
> no difference at all. I actually still have some of that info, so
> I'll include it here. This was a 4-core xeon, E5506 2.1GHZ, 4 1TB
> WD RE3 drives in a RAIDZ1 array, 8GB RAM.
>
> I tested three things - time to load an 8.5GB dump of one of our
> dbs, time to run through a querylog of real data (1.4M queries), and
> then pgbench with a scaling factor of 100, 20 clients, 10K
> transactions per client.
>
> default 128K zfs recordsize:
>
> -9 minutes to load data
> -17 minutes to run query log
> -pgbench output
>
> transaction type: TPC-B (sort of)
> scaling factor: 100
> query mode: simple
> number of clients: 20
> number of transactions per client: 10000
> number of transactions actually processed: 200000/200000
> tps = 100.884540 (including connections establishing)
> tps = 100.887593 (excluding connections establishing)
>
> 8K zfs recordsize (wipe data dir and reinit db)
>
> -10 minutes to laod data
> -21 minutes to run query log
> -pgbench output
>
> transaction type: TPC-B (sort of)
> scaling factor: 100
> query mode: simple
> number of clients: 20
> number of transactions per client: 10000
> number of transactions actually processed: 200000/200000
> tps = 97.896038 (including connections establishing)
> tps = 97.898279 (excluding connections establishing)
>
> Just thought I'd include that since I have the data.
>
>>
>>> I don't have any real questions for the list, but I'd love to get
>>> some feedback, especially on the ZIL results. The ZIL results
>>> interest me because I have not settled on what sort of box we'll be
>>> using as a replication slave for this one - I was going to either go
>>> the somewhat risky route of another all-SSD box or looking at just
>>> how cheap I can go with lots of 2.5" SAS drives in a 2U.
>>
>> You probably know the answer to that: if you need lots of storage, you'll probably be better off using large SATA drives with small SSDs for the ZIL. 160 GB is probably more than you need for ZIL.
>>
>> One thing I never tried is mirroring a SATA drive and a SSD (only makes sense if you don't trust SSDs to be reliable yet) - I don't know if ZFS would recognize the assymetry and direct most of the read requests to the SSD.
>
> Our databases are pretty tiny. We could squeeze them on a pair of 160GB mirrored SSDs.
>
> To be honest, the ZIL results really threw me for a loop. I had supposed that it would work well with bursty usage but that eventually the SATA drives would still be a choke point during heavy sustained sync writes since the difference in random sync write performance between the ZIL drives (SSD) and the actual data drives (SATA) was so huge. The benchmarks ran for quite some time and I am not spotting a point in the system graphs where the SATA gets truly saturated to the point that performance suffers.
>
> I now have to think about whether a safe replication slave/backup could be built in 1U with 4 2.5 SAS drives and a small mirrored pair of SSDs for ZIL. We've been trying to avoid building monster boxes - not only are 2.5" SAS drives expensive, but so is whatever case you find to hold a dozen or so of them. Outside of some old Sun blog posts, I am finding little evidence of people running PostgreSQL on ZFS with SATA drives augmented with SSD ZIL. I'd love to hear more feedback on that.
>
>>
>>> If you have any test requests that can be quickly run on the above
>>> hardware, let me know.
>>
>> Blogbench (benchmarks/blogbench) results are always nice to see in a comparison.
>
> I don't know much about it, but here's what I get on the zfs mirrored SSD pair:
>
> [root(at)bltest1 /usr/ports/benchmarks/blogbench]# blogbench -d /tmp/bbench
>
> Frequency = 10 secs
> Scratch dir = [/tmp/bbench]
> Spawning 3 writers...
> Spawning 1 rewriters...
> Spawning 5 commenters...
> Spawning 100 readers...
> Benchmarking for 30 iterations.
> The test will run during 5 minutes.
> […]
>
> Final score for writes: 182
> Final score for reads : 316840
>
> Thanks,
>
> Charles
>
>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>

In response to

Responses

Browse pgsql-performance by date

  From Date Subject
Next Message Ivan Voras 2012-02-13 22:12:01 Re: rough benchmarks, sata vs. ssd
Previous Message Peter Geoghegan 2012-02-12 23:37:14 Re: random_page_cost = 2.0 on Heroku Postgres