Where to find "best practice" for any given IT scenario
-
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math). From what I recall, 3.3 TB has like a 30% chance of hitting a URE, AKA total failure of the array. At something around 12TB there is statistically a 100% chance of hitting a URE (OK it might actually be 99.99%)
-
@scottalanmiller said:
@Dashrender said:
@scottalanmiller said:
Best practice is to simply remove it from consideration to clarify the remaining choices.
This reminds me of Darren's talk at SpiceWorld - only give the CEO/CFO the choices that you approve. Never provide them one that you don't want, they'll always pick that one.
By providing it, you are presenting it as an option. Basically meaning you approved it. You can tell them your top choices, but if you include it in the list, it's approved to some degree or the conversation is confused.
What surprises me is how often IT people will present completely unreasonable options as options to management. If your car got a flat, would you offer to 1) fix the flat or 2) set the car on fire? No, you would not offer something ridiculous that isn't reasonable. But IT often does this to management.
The more likely scenario is that management will reject all provided solutions and ask why it can't be done cheaper. Of course it can be done cheaper, but with orders of magnitude more risk. What is the recommendation then?
-
Using RAID 6 or RAID 10 which are both safer than RAID 5.
-
@Dashrender said:
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math).
Cool. Facts are important here. A failure probability of 0.001% is 100 times higher than 0.00001%, so on that grounds it is an order of magnitude less reliable. But both are such tiny numbers that they could be ignored. That's where 'slightly' more reliable would also apply.
To go back to your car analogy, a Fiat is (probably) an order of magnitude less reliable than a Honda, but both are so reliable that you wouldn't necessarily say buying a Honda is best practice. Importing a car from North Korea might, however, be considered bad practice.
-
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
Can you even buy 300GB drives today?
Reliability should never be considered from a point of "already failed", that misses part of the big picture. A RAID 5 array is more likely to experience a drive failure than RAID 10 as a starting point. We need to think about the total reliability, not the reliability from a single scenario.
Imagine this question to demonstrate why this is important:
"Which is more likely to survive a front end collision of 20mph, a Volvo C70 or a Ford Pinto?" You'd say the Volvo C70, of course.
But that assumes both cars HAVE had that accident. What if that wasn't the whole scenario? Let's ask again...
"Which is more likely to injure its passengers, a Volvo C70 driving 50pmh on the highway or a Ford Pinto sitting idle in a garage?"
Suddenly the tables turn, because while one is more likely to survive an accident, the other is safer by avoiding the accident which is even more effective.
-
@Dashrender said:
@scottalanmiller said:
@Dashrender said:
@scottalanmiller said:
Best practice is to simply remove it from consideration to clarify the remaining choices.
This reminds me of Darren's talk at SpiceWorld - only give the CEO/CFO the choices that you approve. Never provide them one that you don't want, they'll always pick that one.
By providing it, you are presenting it as an option. Basically meaning you approved it. You can tell them your top choices, but if you include it in the list, it's approved to some degree or the conversation is confused.
What surprises me is how often IT people will present completely unreasonable options as options to management. If your car got a flat, would you offer to 1) fix the flat or 2) set the car on fire? No, you would not offer something ridiculous that isn't reasonable. But IT often does this to management.
The more likely scenario is that management will reject all provided solutions and ask why it can't be done cheaper. Of course it can be done cheaper, but with orders of magnitude more risk. What is the recommendation then?
You say that it cannot be done cheaper while meeting goals. Ask them what goal they want to drop to reduce cost.
-
@Dashrender said:
@Carnival-Boy said:
OK, take two typical SMB servers, each with 12 x 300GB disks. One is configured with RAID 10 and one is configured with RAID 5.
One of the disks in each machine fails and is replaced. What is the probability in each case that the array will not rebuild successfully? Roughly speaking.
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math). From what I recall, 3.3 TB has like a 30% chance of hitting a URE, AKA total failure of the array. At something around 12TB there is statistically a 100% chance of hitting a URE (OK it might actually be 99.99%)
Not that risky on the small SAS drives that are implied. But still riskier.
-
@Carnival-Boy said:
@Dashrender said:
I can see Scott in the corner right now doing the math (or just posting a link to where he's already done the math).
Cool. Facts are important here. A failure probability of 0.001% is 100 times higher than 0.00001%, so on that grounds it is an order of magnitude less reliable. But both are such tiny numbers that they could be ignored. That's where 'slightly' more reliable would also apply.
Easy way to think of it is.... RAID 10 you should expect to go a lifetime without hearing about anyone who has ever had this issue. RAID 5 you should expect multiple complete failures in your career.
RAID 10 failure rates are less than 1 in 80,000 array years. RAID 5 is closer to 1 in 20.
There are so many factors that go into this from drives being more likely to fail, longer time for rebuilds, risk during rebuild, rebuild causing other drives to fail, risk of memory issues, etc.
-
Based on using the different RAID types, of course.
-
Trying to eyeball the math, at 3.3TB of usable data, that RAID 5 array would fail way over 50% of the time with consumer class drives (like Red Pro.) So enterprise drives (like RE) which are 10x more reliable in regards to URE we would expect rebuilt risk from URE alone to be 5% or higher.
That is a one in twenty chance that the RAID 5 array would lose all of its data. This does not take into account secondary drive failure risk which is pretty big as well.
I would not put a one in twenty or maybe one in ten chance of failure on the same playing field as "so reliable no study can measure it completely." RAID 10 failures at 80,000 array years was only the known healthy rate, all that is know is that it is more reliable than that. Zero failures at 80,000 array years!
-
OK, RAID 5 isn't best practice. That's a relatively easy one. Give me some more examples where the term "best practice" might apply. I'm not convinced the term is that meaningful.
I'm having an extension built on my house at the moment, and I hear the term used quite a bit by my builders. There's building regulations that are legally required and there's ones that are best practice. For example, a shaver point should be located at least 30cm from the sink. That's not a legal requirement, but it's best practice. Smoke detectors should be mains powered not battery powered. Again, that's best practice rather than a legal requirement. These practices are pretty formal though - either by the manufacturer, or by the building regulators. I don't see much equivalence in the IT industry (sadly, as it would be super useful).
-
Best Practice: If data is valuable enough to be stored, it should be backed up.
-
@Carnival-Boy said:
OK, RAID 5 isn't best practice. That's a relatively easy one.
Actually it is a hard one, while it is a well documented best practice among storage experts, the industry as a whole lacks that expertise and pushes it heavily.
-
It's an easy one for anyone who hangs around the same forums you do
-
Another best practice: virtualize every workload (unless it is impossible to do so)
-
@scottalanmiller said:
Another best practice: virtualize every workload (unless it is impossible to do so)
What are some workloads it would be impossible to virtualize? With the exception of real-time, ulta-low latency requirements, I cannot think of anything.
-
@dafyre said:
What are some workloads it would be impossible to virtualize? With the exception of real-time, ulta-low latency requirements, I cannot think of anything.
Those and ones with very specific hardware requirements either technically or politically. That's about it. It is rare enough that it is effective to just say "never".
-
Workloads that you can't get working virtualised for whatever reason. I couldn't get Hamachi to work virtualised. Googling suggested a common problem with Hamachi not liking the VMware network drivers or something.
I've virtualised our firewall. I wonder if there's an argument that says I shouldn't because it means I have a hypervisor on a public facing host. Maybe? I dunno, could that be a security risk? It's not something I'm going to lose any sleep over.
-
@Carnival-Boy said:
I've virtualised our firewall. I wonder if there's an argument that says I shouldn't because it means I have a hypervisor on a public facing host. Maybe? I dunno, could that be a security risk? It's not something I'm going to lose any sleep over.
You can virtualize that without exposing the hypervisor in any way.
-
That's what I figured. I suppose I was wondering about accidentally exposing the hypervisor through human error.