RAID rebuild times 16TB drive

scottalanmiller

@biggen said in RAID rebuild times 16TB drive:

So, for example, a RAID 1 16TB mirror would have the same rebuild time as a RAID 6 32TB array (4 x 16TB) or a RAID 10 32TB array (4 x 16TB)? I must be misunderstanding.

Under the situation where the only bottleneck in question is the write speed of the receiving drive, you'd absolutely expect the write time to be the same, because the source is taken out of the equation (by the phrase "no other bottlenecks") and the absolute only factor in play is the write speed of the same drive. So of course it will be identical.

biggen

@scottalanmiller @Pete-S

Excellent. Thanks for that explanation guys and that nifty diagram Pete!

I guess I was skeptical I had correct what @Pete-S said because I've seen so many reports that its taken days/weeks to rebuild [insert whatever size] TB Raid 6 arrays in the past. But I guess that was because those systems weren't just idle. There was still IOPS on those arrays AND a possible CPU/cache bottleneck.

1337

@biggen said in RAID rebuild times 16TB drive:

@scottalanmiller @Pete-S

Excellent. Thanks for that explanation guys and that nifty diagram Pete!

I guess I was skeptical I had correct what @Pete-S said because I've seen so many reports that its taken days/weeks to rebuild [insert whatever size] TB Raid 6 arrays in the past. But I guess that was because those systems weren't just idle. There was still IOPS on those arrays AND a possible CPU/cache bottleneck.

We don't see any bottlenecks on our software RAID-6 arrays but they run bare metal on standard servers. That might be atypical, I don't know.

But I think regular I/O has a much bigger effect than any bottleneck. I can see how MB/sec takes a nose dive when rebuilding and there is activity on the drive array.

If you think about it, when the drive only does rebuilding it's just doing sequential read/writes and hard drives are up to 50% as fast as SATA SSDs at this. But when other I/O comes in, it becomes a question of IOPS. And hard drives are really bad at this and only have about 1% of the IOPS of an SSDs.

scottalanmiller

@biggen said in RAID rebuild times 16TB drive:

I guess I was skeptical I had correct what @Pete-S said because I've seen so many reports that its taken days/weeks to rebuild [insert whatever size] TB Raid 6 arrays in the past. But I guess that was because those systems weren't just idle. There was still IOPS on those arrays AND a possible CPU/cache bottleneck.

Well you are still skipping the one key phrase "no other bottlenecks." According to most reports, there are normally extreme bottlenecks (either because of computational time and/or systems not being completely idle) so the information you are getting is in no way counter to what you've already heard as reports.

You are responding as if you feel that this is somehow different, but it is not.

It's a bit like hearing that a Chevy Sonic can go 200 mph when dropped out of an airplane, and then saying that most people say that they never get it over 90mph, and ignoring the obvious key fact that it's being dropped out of an airplane that let's it go so fast.

scottalanmiller

@Pete-S said in RAID rebuild times 16TB drive:

We don't see any bottlenecks on our software RAID-6 arrays but they run bare metal on standard servers. That might be atypical, I don't know.

Even on bare metal, we normally see a lot of bottlenecks. But normally because almost no one can make their arrays go idle during a rebuild cycle. If they could, they'd not need the rebuild in the first place, typically.

StorageNinja

@scottalanmiller said in RAID rebuild times 16TB drive:

Its a system, not an IO, bottleneck typically. Especially with RAID 6. Its math that runs on a single thread.

Distributed storage systems with per object raid FTW here. If I have every VMDK running it's own rebuild process (vSAN) or every individual LUN/CPG (how Compellent or 3PAR do it) then a given drive failing is a giant party across all of the drives in the cluster/system. (Also how the fancy erasure code array systems run this).

StorageNinja

@scottalanmiller said in RAID rebuild times 16TB drive:

Even on bare metal, we normally see a lot of bottlenecks. But normally because almost no one can make their arrays go idle during a rebuild cycle. If they could, they'd not need the rebuild in the first place, typically.

Our engineers put in a default "reserve 80% of max throughput for production" IO schedular QoS system, so at saturation rebuilds only get 20% so they don't murder production IO. (note rebuilds can use 100% if the bandwidth is there for the taking).

StorageNinja

@biggen said in RAID rebuild times 16TB drive:

I guess I was skeptical I had correct what @Pete-S said because I've seen so many reports that its taken days/weeks to rebuild [insert whatever size] TB Raid 6 arrays in the past. But I guess that was because those systems weren't just idle. There was still IOPS on those arrays AND a possible CPU/cache bottleneck.

Was the drive full? Smarter new RAID rebuild systems don't rebuild empty LBAs. Every enterprise storage array system has done this with rebuilds for the last 20 years...

biggen

@StorageNinja No personal experience with it. I've only ever run RAID 1 or 10. Just the reading I've done over the years from people reporting how long it took to rebuild larger RAID 6 arrays.

BTW, are you the same person who is/was over at Spiceworks? I always enjoyed reading your posts on storage. I respect both you and @scottalanmiller in this arena immensely.

JaredBusch

@biggen said in RAID rebuild times 16TB drive:

BTW, are you the same person who is/was over at Spiceworks?

Yes he is.

scottalanmiller

@biggen said in RAID rebuild times 16TB drive:

BTW, are you the same person who is/was over at Spiceworks? I always enjoyed reading your posts on storage. I respect both you and @scottalanmiller in this arena immensely.

Yup, he's one of the "Day Zero" founders over here.

scottalanmiller

@StorageNinja said in RAID rebuild times 16TB drive:

@scottalanmiller said in RAID rebuild times 16TB drive:

Its a system, not an IO, bottleneck typically. Especially with RAID 6. Its math that runs on a single thread.

Distributed storage systems with per object raid FTW here. If I have every VMDK running it's own rebuild process (vSAN) or every individual LUN/CPG (how Compellent or 3PAR do it) then a given drive failing is a giant party across all of the drives in the cluster/system. (Also how the fancy erasure code array systems run this).

Yeah, that's RAIN and that basically solves everything