Re: New server: SSD/RAID recommendations?

From: "Graeme B(dot) Bell" <graeme(dot)bell(at)nibio(dot)no>
To: "Mkrtchyan, Tigran" <tigran(dot)mkrtchyan(at)desy(dot)de>
Cc: "Graeme B(dot) Bell" <graeme(dot)bell(at)nibio(dot)no>, Steve Crawford <scrawford(at)pinpointresearch(dot)com>, "Wes Vaske (wvaske)" <wvaske(at)micron(dot)com>, pgsql-performance <pgsql-performance(at)postgresql(dot)org>
Subject: Re: New server: SSD/RAID recommendations?
Date: 2015-07-07 12:04:21
Message-ID: 3DD748C4-3F21-4B19-9EF8-CFA8DCF83523@skogoglandskap.no
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-performance


1. Does the sammy nvme have *complete* power loss protection though, for all fsync'd data?
I am very badly burned by my experiences with Crucial SSDs and their 'power loss protection' which doesn't actually ensure all fsync'd data gets into flash.
It certainly looks pretty with all those capacitors on top in the photos, but we need some plug pull tests to be sure.

2. Apologies for the typo in the previous post, raidz5 should have been raidz1.

3. Also, something to think about when you start having single disk solutions (or non-ZFS raid, for that matter).

SSDs are so unlike HDDs.

The samsung nvme has a UBER (uncorrectable bit error rate) measured at 1 in 10^17. That's one bit gone bad in 12500 TB, a good number. Chances are the drives fails before you hit a bit error, and if not, ZFS would catch it.

Whereas current HDDS are at the 1 in 10^14 level. That means an error every 12TB, by the specs. That means, every time you fill your cheap 6-8TB seagate drive, it likely corrupted some of your data *even if it performed according to the spec*. (That's also why RAID5 isn't viable for rebuilding large arrays, incidentally).

Graeme Bell

On 07 Jul 2015, at 12:56, Mkrtchyan, Tigran <tigran(dot)mkrtchyan(at)desy(dot)de> wrote:

>
>
> ----- Original Message -----
>> From: "Graeme B. Bell" <graeme(dot)bell(at)nibio(dot)no>
>> To: "Mkrtchyan, Tigran" <tigran(dot)mkrtchyan(at)desy(dot)de>
>> Cc: "Graeme B. Bell" <graeme(dot)bell(at)nibio(dot)no>, "Steve Crawford" <scrawford(at)pinpointresearch(dot)com>, "Wes Vaske (wvaske)"
>> <wvaske(at)micron(dot)com>, "pgsql-performance" <pgsql-performance(at)postgresql(dot)org>
>> Sent: Tuesday, July 7, 2015 12:38:10 PM
>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
>
>> I am unsure about the performance side but, ZFS is generally very attractive to
>> me.
>>
>> Key advantages:
>>
>> 1) Checksumming and automatic fixing-of-broken-things on every file (not just
>> postgres pages, but your scripts, O/S, program files).
>> 2) Built-in lightweight compression (doesn't help with TOAST tables, in fact
>> may slow them down, but helpful for other things). This may actually be a net
>> negative for pg so maybe turn it off.
>> 3) ZRAID mirroring or ZRAID5/6. If you have trouble persuading someone that it's
>> safe to replace a RAID array with a single drive... you can use a couple of
>> NVMe SSDs with ZFS mirror or zraid, and get the same availability you'd get
>> from a RAID controller. Slightly better, arguably, since they claim to have
>> fixed the raid write-hole problem.
>> 4) filesystem snapshotting
>>
>> Despite the costs of checksumming etc., I suspect ZRAID running on a fast CPU
>> with multiple NVMe drives will outperform quite a lot of the alternatives, with
>> great data integrity guarantees.
>
>
> We are planing to have a test setup as well. For now I have single NVMe SSD on my
> test system:
>
> # lspci | grep NVM
> 85:00.0 Non-Volatile memory controller: Samsung Electronics Co Ltd NVMe SSD Controller 171X (rev 03)
>
> # mount | grep nvm
> /dev/nvme0n1p1 on /var/lib/pgsql/9.5 type ext4 (rw,noatime,nodiratime,data=ordered)
>
>
> and quite happy with it. We have write heavy workload on it to see when it will
> break. Postgres Performs very well. About x2.5 faster than with regular disks
> with a single client and almost linear with multiple clients (picture attached.
> On Y number of high level op/s our application does, X number of clients). The
> setup is used last 3 months. Looks promising but for production we need to
> to have disk size twice as big as on the test system. Until today, I was
> planning to use a RAID10 with a HW controller...
>
> Related to ZFS. We use ZFSonlinux and behaviour is not as good as with solaris.
> Let's re-phrase it: performance is unpredictable. We run READZ2 with 30x3TB disks.
>
> Tigran.
>
>>
>> Haven't built one yet. Hope to, later this year. Steve, I would love to know
>> more about how you're getting on with your NVMe disk in postgres!
>>
>> Graeme.
>>
>> On 07 Jul 2015, at 12:28, Mkrtchyan, Tigran <tigran(dot)mkrtchyan(at)desy(dot)de> wrote:
>>
>>> Thanks for the Info.
>>>
>>> So if RAID controllers are not an option, what one should use to build
>>> big databases? LVM with xfs? BtrFs? Zfs?
>>>
>>> Tigran.
>>>
>>> ----- Original Message -----
>>>> From: "Graeme B. Bell" <graeme(dot)bell(at)nibio(dot)no>
>>>> To: "Steve Crawford" <scrawford(at)pinpointresearch(dot)com>
>>>> Cc: "Wes Vaske (wvaske)" <wvaske(at)micron(dot)com>, "pgsql-performance"
>>>> <pgsql-performance(at)postgresql(dot)org>
>>>> Sent: Tuesday, July 7, 2015 12:22:00 PM
>>>> Subject: Re: [PERFORM] New server: SSD/RAID recommendations?
>>>
>>>> Completely agree with Steve.
>>>>
>>>> 1. Intel NVMe looks like the best bet if you have modern enough hardware for
>>>> NVMe. Otherwise e.g. S3700 mentioned elsewhere.
>>>>
>>>> 2. RAID controllers.
>>>>
>>>> We have e.g. 10-12 of these here and e.g. 25-30 SSDs, among various machines.
>>>> This might give people idea about where the risk lies in the path from disk to
>>>> CPU.
>>>>
>>>> We've had 2 RAID card failures in the last 12 months that nuked the array with
>>>> days of downtime, and 2 problems with batteries suddenly becoming useless or
>>>> suddenly reporting wildly varying temperatures/overheating. There may have been
>>>> other RAID problems I don't know about.
>>>>
>>>> Our IT dept were replacing Seagate HDDs last year at a rate of 2-3 per week (I
>>>> guess they have 100-200 disks?). We also have about 25-30 Hitachi/HGST HDDs.
>>>>
>>>> So by my estimates:
>>>> 30% annual problem rate with RAID controllers
>>>> 30-50% failure rate with Seagate HDDs (backblaze saw similar results)
>>>> 0% failure rate with HGST HDDs.
>>>> 0% failure in our SSDs. (to be fair, our one samsung SSD apparently has a bug
>>>> in TRIM under linux, which I'll need to investigate to see if we have been
>>>> affected by).
>>>>
>>>> also, RAID controllers aren't free - not just the money but also the management
>>>> of them (ever tried writing a complex install script that interacts work with
>>>> MegaCLI? It can be done but it's not much fun.). Just take a look at the
>>>> MegaCLI manual and ask yourself... is this even worth it (if you have a good
>>>> MTBF on an enterprise SSD).
>>>>
>>>> RAID was meant to be about ensuring availability of data. I have trouble
>>>> believing that these days....
>>>>
>>>> Graeme Bell
>>>>
>>>>
>>>> On 06 Jul 2015, at 18:56, Steve Crawford <scrawford(at)pinpointresearch(dot)com> wrote:
>>>>
>>>>>
>>>>> 2. We don't typically have redundant electronic components in our servers. Sure,
>>>>> we have dual power supplies and dual NICs (though generally to handle external
>>>>> failures) and ECC-RAM but no hot-backup CPU or redundant RAM banks and...no
>>>>> backup RAID card. Intel Enterprise SSD already have power-fail protection so I
>>>>> don't need a RAID card to give me BBU. Given the MTBF of good enterprise SSD
>>>>> I'm left to wonder if placing a RAID card in front merely adds a new point of
>>>>> failure and scheduled-downtime-inducing hands-on maintenance (I'm looking at
>>>>> you, RAID backup battery).
>>>>
>>>>
>>>>
>>>> --
>>>> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
>>>> To make changes to your subscription:
>>>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance(at)postgresql(dot)org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
> <pg-with-ssd.png>

In response to

Browse pgsql-performance by date

  From Date Subject
Next Message Karl Denninger 2015-07-07 12:39:01 Re: New server: SSD/RAID recommendations?
Previous Message Graeme B. Bell 2015-07-07 11:52:07 Re: New server: SSD/RAID recommendations?