Safe to have a 48TB Windows volume?
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
And even the anecdote doesn't tell you that RAID 6 would have protected you. Only that RAID 10 wasn't able to.
Had you used RAID 6, it might have failed too, possibly worse, and we'd be having the opposite conversation about how you can never trust RAID 6.
Bottom line, using individual anecdotes for answers is the one thing we know is bad to do.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
And even the anecdote doesn't tell you that RAID 6 would have protected you. Only that RAID 10 wasn't able to.
apparently he needed the 3 drive RAID 10 pairs that other guy was running.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
The fact that you and possibly your org has actually studied things is important to the discussion.
We've had enough double disk failures over time to have influenced the decision to drop RAID 5. The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
The fact that you and possibly your org has actually studied things is important to the discussion.
I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.
What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?
How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?
-
With lots of double disk failures, the real thing you need to be looking at is the disks that you have or the environment that they are in. RAID 5 carries huge risk, but it shouldn't primarily be from double disk failures. That that is what led you away from RAID 5 should have been a red flag that something else was wrong. Double disk failure can happen to anyone, of course, but lots of them indicates a trend that isn't RAID related.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
The fact that you and possibly your org has actually studied things is important to the discussion.
I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.
One of the reasons we adopted Storage Spaces as a platform was because of the auto-retire and rebuild into free pool space via parallel rebuild. With, at that time, 2TB and larger drives becoming all the more common rebuild times on the RAID controller were taking a long time to happen.
Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).
RAID can't do that for us.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
Some examples of things we have math to tell us are good or bad...
RAID 10 .... we've done massive empirical studies. We know that the RAID systems themselves are insanely reliable.
Cheap SAN like the P2000 .... we know that by collecting anecdotes, and knowing total sales figures, that the failure rates of those observed alone is too high for the entire existing set of products made, and we can safely assume that the number we have not observed is vastly higher. But observation alone tells us that the reliability is not high enough for any production use.We lost an entire virtualization platform and had to recover from scratch because the second member of a RAID 10 pair failed after replacing the first and a rebuild initiating. We'll stick with RAID 6 thanks.
EDIT: The on-site IT and I were well into our coffee chat when the spontaneous beep/beep happened and we were both, WTF?
See, that's an irrational, emotional reaction that we are trying to avoid. You have one anecdote that tells you nothing, but you make a decision based on it that goes against math and empirical studies. Why?
The fact that you and possibly your org has actually studied things is important to the discussion.
I've published about it and speak about it all the time. The study was massive. And took forever. As you can imagine.
One of the reasons we adopted Storage Spaces as a platform was because of the auto-retire and rebuild into free pool space via parallel rebuild. With, at that time, 2TB and larger drives becoming all the more common rebuild times on the RAID controller were taking a long time to happen.
RAID 1 / 10 rebuilds are generally... acceptable. But if you were choosing RAID 5 or 6, then the rebuilds are absurd. But we've known that for a long time. It just takes so much work to rebuild a system of that nature.
But this goes into the earlier discussion, if you were using math rather than emotions before moving to RAIN, it seems it would have kept you to RAID 1 or 10 all along.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
With lots of double disk failures, the real thing you need to be looking at is the disks that you have or the environment that they are in. RAID 5 carries huge risk, but it shouldn't primarily be from double disk failures. That that is what led you away from RAID 5 should have been a red flag that something else was wrong. Double disk failure can happen to anyone, of course, but lots of them indicates a trend that isn't RAID related.
One was environment. The site had the HVAC above the ceiling tiles all messed up with primary paths not capped. So, air return did not work and A/C in the summer stayed above the ceiling tiles and heat in the winter as well. The server closet during the winter could easily hit 40C. There were no more circuits available anywhere in the leased space so we couldn't even get a portable A/C in there.
We experienced four, count them, four catastrophic failures at that site. The owners knew why but we were helpless against it. So, we build-out a highly available system using two servers, third party products, and a really good backup set (BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).
There's statistics. Then there's d*mned statistics.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).
RAID can't do that for us.
Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.
Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).
RAID can't do that for us.
Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.
Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.
We're seeing the same thing in Solid-State now too. As SSD vendors deliver larger and larger capacity devices the write speeds all of a sudden become a limiting factor. Go figure. :S
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
Parallel rebuild into free pool space rebuilds that dead disk across all members in the pool. So, 2TB gets done in minutes instead of hours/days. Plus, once that dead drive's contents is rebuilt into free pool space so long as there is more free pool space to be had (size of disk + ~150GB) another disk failure can happen and still maintain the two disk resilience (Dual Parity or 3-Way Mirror).
RAID can't do that for us.
Absolutely, this is a huge reason why RAIN has been replacing RAID for a long time. We've had that for many years. Large capacity is making RAID simply ineffective, no surprises there. "Shuffling" data around as needed is a powerful tool.
Technically, RAID can do this, but does it very poorly. It's a feature of hybrid RAID.
We're seeing the same thing in Solid-State now too. As SSD vendors deliver larger and larger capacity devices the write speeds all of a sudden become a limiting factor. Go figure. :S
Yes, RAID will unlikely ever make a large come back. The scale of storage in the future simply makes device-centric protection ineffecitve long term.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
(BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).
What is BUE?
-
@FATeknollogee said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
(BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).
What is BUE?
BackUp Exec from Symantec
-
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.
What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?
How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?
The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table. The number of times the same thing happened in a RAID 1 setting was also a factor.
-
@FATeknollogee said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
(BUE failed us so we moved to Storagecraft ShadowProtect which has been flawless to date).
What is BUE?
Sorry, I should have broken the acronym out. It's Backup Exec at one time by Colorado when it was an awesome product then Symantec when things went downhill from there.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.
What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?
How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?
The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table.
Yet we know that RAID 5/6 rebuild adds stress to every drive in the array instead of just a single drive.
-
@travisdh1 said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.
What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?
How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?
The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table.
Yet we know that RAID 5/6 rebuild adds stress to every drive in the array instead of just a single drive.
Concur, but the stress is a lot more distributed.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@travisdh1 said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.
What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?
How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?
The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table.
Yet we know that RAID 5/6 rebuild adds stress to every drive in the array instead of just a single drive.
Concur, but the stress is a lot more distributed.
I think you're missing my point. Rebuilding a RAID1 or 10 array stresses 1 drive. Rebuilding a RAID 5 or 6 array stresses every single drive in the array. Stressing 1 drive vs stressing X drives where we know X is always greater than 1.
-
@PhlipElder said in Safe to have a 48TB Windows volume?:
@scottalanmiller said in Safe to have a 48TB Windows volume?:
@PhlipElder said in Safe to have a 48TB Windows volume?:
The RAID 10 failure was icing on the cake. Not an emotional reaction, just one that falls into what we've experienced failure wise across the board.
What math did you use to make a single, very unusual RAID 10 failure lead you to something riskier?
How can it be non-emotional unless your discovery was that data loss simply didn't affect you and increasing risk was okay to save money on needing fewer disks?
The statistics for double disk failures. The rebuild rates throw an extra amount of stress on the RAID 10's buddy. That places an extra amount of risk on the table. The number of times the same thing happened in a RAID 1 setting was also a factor.
The stress is tiny, so small that the industry hasn't recognized it as a known stress. There has to be some, but it is very small.