Re: OT - 2 of 4 drives in a Raid10 array failed - Any chance of recovery?

From: Greg Smith <gsmith(at)gregsmith(dot)com>
To: Craig Ringer <craig(at)postnewspapers(dot)com(dot)au>
Cc: PG-General Mailing List <pgsql-general(at)postgresql(dot)org>, ow(dot)mun(dot)heng(at)wdc(dot)com
Subject: Re: OT - 2 of 4 drives in a Raid10 array failed - Any chance of recovery?
Date: 2009-10-21 06:30:35
Message-ID: alpine.GSO.2.01.0910210210180.1418@westnet.com
Views: Raw Message | Whole Thread | Download mbox | Resend email
Thread:
Lists: pgsql-general

On Tue, 20 Oct 2009, Craig Ringer wrote:

> You made an exact image of each drive onto new, spare drives with `dd'
> or a similar disk imaging tool before trying ANYTHING, right? Otherwise,
> you may well have made things worse, particularly since you've tried to
> resync the array. Even if the data was recoverable before, it might not
> be now.

This is actually pretty hard to screw up with Linux software RAID. It's
not easy to corrupt a working volume by trying to add a bogus one or
typing simple commands wrong. You'd have to botch the drive addition
process altogether and screw with something else to take out a good drive.

> If the problem is just a few bad sectors, you can usually just
> force-re-add the drives into the array and then copy the array contents
> to another drive either at a low level (with dd_rescue) or at a file
> system level.

This approach has saved me more than once. On the flip side, I have also
more than once accidentally wiped out my only good copy of the data when
making a mistake during an attempt at stressed out heroics like this.
You certainly don't want to wander down this more complicated path if
there's a simple fix available within the context of the standard tools
for array repairs.

> On a side note: I'm personally increasingly annoyed with the tendency of
> RAID controllers (and s/w raid implementations) to treat disks with
> unrepairable bad sectors as dead and fail them out of the array.

Given how fast drives tend to go completely dead once the first error
shows up, this is a reasonable policy in general.

> Rather than failing a drive and as a result rendering the whole array
> unreadable in such situations, it should mark the drive defective, set
> the array to read-only, and start screaming for help.

The idea is great, but you have to ask just exactly how the hardware and
software involved is supposed to enforce making the array read-only. I
don't think the ATA and similar command sets have that concept implemented
in a way you can actually do this at the level it would need to happen at
for hardware RAID to implement this idea. Linux software RAID could keep
you from mounting the array read/write in this situation, but the way
errors percolate up from the disk devices to the array ones in Linux has
too many layers in it (especially if LVM is stuck in the middle there too)
for that to be simple to implement either.

--
* Greg Smith gsmith(at)gregsmith(dot)com http://www.gregsmith.com Baltimore, MD

In response to

Browse pgsql-general by date

  From Date Subject
Next Message Arnaud Lesauvage 2009-10-21 06:47:27 Re: [postgis-users] pgsql2shp : Encoding headache
Previous Message Scott Marlowe 2009-10-21 06:25:29 Re: OT - 2 of 4 drives in a Raid10 array failed - Any chance of recovery?