What To Do when RAID Has a Hard Drive Failure

dafyre

@scottalanmiller said:

How Long Will a RAID Rebuild Take?

For moderately sized arrays it is not uncommon for this process to takes days or even weeks and it is not unheard of or unobserved for the process on RAID 6 to take over a month.

Wow, I realize that drives in single arrays don't fail all that often, but damn, a month? What is the likeliness of a second or even third drive failure in a situation where the resilver takes a month? This reduced performance and risk state in that situation really seems to make RAID 6 a bad choice, unless having access to the data just isn't that critical, and rebuilding the array from backups is almost considered norm.

A month is not unreasonable. I've seen a 7TB RAID 5 setup take a week to rebuild.

scottalanmiller

@Dashrender said:

@scottalanmiller said:

How Long Will a RAID Rebuild Take?

For moderately sized arrays it is not uncommon for this process to takes days or even weeks and it is not unheard of or unobserved for the process on RAID 6 to take over a month.

Wow, I realize that drives in single arrays don't fail all that often, but damn, a month? What is the likeliness of a second or even third drive failure in a situation where the resilver takes a month? This reduced performance and risk state in that situation really seems to make RAID 6 a bad choice, unless having access to the data just isn't that critical, and rebuilding the array from backups is almost considered norm.

Secondary drive failure starts to become incredibly common under a month of parity resilver stress. This is just one of so many reasons why parity RAID is rarely good for production workloads

scottalanmiller

@dafyre said:

A month is not unreasonable. I've seen a 7TB RAID 5 setup take a week to rebuild.

And that's a tiny array by modern standards! Just move that to RAID 6 and you likely add 20% or more time for that rebuild because there is so much more math involved. And lots of modern arrays are 14TB or bigger. That same array on 14TB would be a fortnight. Put that on RAID 6, which would be absolutely necessary for an array so large, and you are up to three weeks almost certainly.

Now when a lot of these shops are talking 30TB+ and always RAID 6 or RAID 7, you start to see where a month is easy to hit. I've seen some that we estimated at far longer.

And if people are using the drive during the day, you can easily lose 30% of your rebuild time from the array being under load.

Dashrender

Again, at that point, is it worth the risk of resilvering? Seems almost better to ensure you have a good backup, then wipe restore - hell, it will probably take less time.

scottalanmiller

@Dashrender said:

Again, at that point, is it worth the risk of resilvering? Seems almost better to ensure you have a good backup, then wipe restore - hell, it will probably take less time.

If you believe that to be the case, you should probably not have had a RAID array in the first place It would be pretty rare that you made a RAID array with the intent of not recovering from a drive failure.

Of course, it does, we presume, afford an opportunity to take a last minute backup before doing the rebuild with a fresh array, but it requires it.

But it is almost better to take rapid backups and use RAID 0 if we don't intend to ever do the drive replacement on parity RAID.

BRRABill

One thing that always confuses me (and perhaps this is just with DELL servers) is that if the drive isn't actually failed yet (but in predictive failure status) you have to (or are supposed to) log and take the drive offline first.

I know you were talking about failed drives, but I think this is also worth mentioning under the same context.

scottalanmiller

@BRRABill said:

One thing that always confuses me (and perhaps this is just with DELL servers) is that if the drive isn't actually failed yet (but in predictive failure status) you have to (or are supposed to) log and take the drive offline first.

Supposed to, because you are preventing the risks that come with it actually failing. When a drive actually fails, the controller offlines it. When it is predictive, it does not. So the drive is alive and spinning if you pull it. It should work, but why add risk? Offline it and be sure that it is spun down and not being used when it gets yanked.

BRRABill

@scottalanmiller said:

Supposed to, because you are preventing the risks that come with it actually failing. When a drive actually fails, the controller offlines it. When it is predictive, it does not. So the drive is alive and spinning if you pull it. It should work, but why add risk? Offline it and be sure that it is spun down and not being used when it gets yanked.

Right. You know me ... always thinking of the NOOB questions. (WWBA? What Would BRRABill Ask? )

marcinozga

I've had 12TB RAID 6 (18TB raw), with 10TB of data on it rebuilt in about 24h. Software RAID shines here.

scottalanmiller

@marcinozga said:

I've had 12TB RAID 6 (18TB raw), with 10TB of data on it rebuilt in about 24h. Software RAID shines here.

What drives and software RAID implementation are you using?

And yes, software RAID can divert system resources to the computations making it much faster. People really don't realize just how much faster software RAID is than hardware RAID.

marcinozga

@scottalanmiller said:

@marcinozga said:

I've had 12TB RAID 6 (18TB raw), with 10TB of data on it rebuilt in about 24h. Software RAID shines here.

What drives and software RAID implementation are you using?

And yes, software RAID can divert system resources to the computations making it much faster. People really don't realize just how much faster software RAID is than hardware RAID.

ZFS on FreeBSD and WD Red 3TB. And quad core Xeon w/HT.

MattSpeller

I usually just power off the server, yank all the drives out, mix em all up and put em back in. I also enjoy testing my backups quite often.

I'll have to try your method next time!