What % is normal availability?
-
My belief, and what I've said to vendors like HPE and they've not said otherwise, is that roughly four nines, 99.99% uptime, is what you expect from a well maintained, under warranty, enterprise server.
I use this reference point often. So this is the DL380 and the R7x0 type boxes. The super mainline commodity enterprise 2U boxes. Most HPE and Dell servers are only nominal variances from these most common models. These, if you have good maintenance plans whether handled by warranty or ready parts replacement, have an expectation of downtime of only a few hours over a decade of operation.
At 99.99%, you get nearly an hour (52.51 minutes) each year. Over a decade, that's several hours. Typically we don't operate or calculate for a decade, but for 4-8 years, which is less time but during higher reliability times. If you assume that that is four hours for each half decade, that's actually lower than the believed outage averages for this hardware over that time span.
Like most things in IT, there is no way to actually get the industry wide numbers on this, not even the vendors have a way to collect this and the numbers are subjective. But based on what we normally accept as measurements, it suggests that normal enterprise servers in the commodity space are better than four nines and high end ones like IBM Power, Oracle Sparc, and HPE Integrity are five or even six nines on average.
This is the same number I always pull out when SAN vendors talk about how reliable they are - often touting numbers lower than normal servers.
-
In a large enterprise datacenter where we got some actual specs on these things (tens of thousands of servers over decades of monitoring) and with good environments, and good (but still commodity) gear, and good datacenter support, we were definitely seeing closer to five nines on HPE DL380 and DL385 gear, rather than four nines. But we paid support premiums to make that possible. But it is definitely feasible with that hardware.
-
@scottalanmiller Looking for some statistics on the subject of server reliability I found the data below.
While I don't know where the data originates from just looking at servers from 1 to 6 years of age we get an average of 3.27 hours of downtime. That translates into 99.96% of availability.
It's a little less than what you estimated but still in the same ballpark I think. -
@pete-s and it's a tough question with "what does downtime mean?" Because lots of shops CHOOSE to continue downtime of one service to avoid losing another, or refuse pro-active maintenance to shift performance impact windows, and so forth. Those decisions to intentionally increase risk or downtime should not be included as they are not outages caused by the server.
I'm sure that the average downtime of servers is higher (as in worse) than the 99.99% number. But if we were to filter the numbers for "when kept in proper datacenters, with proper pro-active monitoring and maintenance, and proper parts replacement practices", this total average suggests that it must be higher than even 99.995%!
-
The same server, treated the same between two companies, will still have very different uptimes based on environmental conditions. A good datacenter will totally change the uptime.
The same server in the same environment, but with different workloads, disk configurations, etc. will have different uptimes.
The chart above suggests that people with RAID 0, no RAID, RAID 5 on SATA disks, etc. That people without dual power supplies. People without proper air conditioning. People without UPS. Are all lumped in with people using their servers "properly" under good conditions.
So if you take away those people and say "only enterprise class servers, with dual power supplies, healthy UPS conditioning, stable HVAC with X temp and humidity variance, dust remittance, RAID 10 on enterprise drives, etc." you get an amazingly different reliability number.
-
Now what is really useful from the chart is that somewhere between six and seven years the downtime risks appear to double on average. That's a REALLY useful number for risk analysis.
-
@scottalanmiller What you say makes sense.
Consider a virtual environment with a consolidation ratio of say 10 for the sake of simplicity.
Assume average 1 hour of yearly downtime on a physical server.The probability of hardware failure on the VM host is the same as for a single server but every hour of downtime would be 1x10=10 hours total of downtime for the VMs.
Having 10 physical servers would mean 10 times the probability of hardware failure so 10x1=10 hours of downtime.
The net effect of this would be that the average availability per server (virtual or physical) would be the same.
-
I would have thought the average availability to be lower somehow.
Given how reliable the equipment really is makes me wonder who really needs high availability. Most people aren't hosting NASDAQ servers exactly.
-
@pete-s said in What % is normal availability?:
The probability of hardware failure on the VM host is the same as for a single server but every hour of downtime would be 1x10=10 hours total of downtime for the VMs.
Having 10 physical servers would mean 10 times the probability of hardware failure so 10x1=10 hours of downtime.
That's true in a sense. But that's not how downtime is measured. The server is down, not the workloads. If you break up workloads you could go to no end. For example, what if each process were a workload? Each VM might have over a thousand processes. Yet the end result is one server down for one hour.
Think of a car. How long was the car broken down? Two days. It doesn't become eight days, even if there were three passengers plus the driver.
Also, if you took ten servers each with one VM, or one server with ten VMs, you magnify the risk equal to the growth rate of workloads. So the average per workload remains the same, with either approach.
Because of the interconnected nature of workloads, the workload risk to the single server approach is tiny compared to the risk of the ten servers, however, in the real world. Most real world factors make the consolidated approach the safer one in almost all cases.
-
@pete-s said in What % is normal availability?:
Given how reliable the equipment really is makes me wonder who really needs high availability. Most people aren't hosting NASDAQ servers exactly.
This is what I've been saying for years. A few things I keep pointing out to people...
- HA is about shaving SECONDS, you have to pay for all of that HA in the tiniest amounts of time.
- HA carries a lot of overhead and its own risks. Many HA systems induce more downtime than they protect against.
- Many HA approaches or solutions only claim to approach uptimes lower than standard, rather than greater than!
-
At about an hour a year, I would say four 9's is acceptable.
-
@bbigford said in What % is normal availability?:
At about an hour a year, I would say four 9's is acceptable.
Yeah, that's pretty good. Especially considering that is one hour at any given time. For a typical business that operates no more than 60 hours a week, the chances that that hour will fall during production hours is pretty low and extremely low if you want it to fall completely without business hours.
For a really typically office environment, four nines uptime from a typical single server starts pushing an average of something like 12 minutes of downtime a year average during production hours. It gets pretty crazy for a typical server in a typical company.
-
Also, never forget this is unscheduled downtime.
Some people like to forget that.
-
We have had our four Dell R630s coming up to two years now. Excluding planned maintenance (for example patching for the Intel vulnerability), the servers have not been unavailable (unplanned) once in that time. Over 99.99%+ for us.
Monthly reboots are performed on the physical servers, but these are planned and VMs are migrated to between hosts for application availability during those windows.
What is the normal % availability for a website/app, rather than server?
-
@jimmy9008 said in What % is normal availability?:
We have had our four Dell R630s coming up to two years now. Excluding planned maintenance (for example patching for the Intel vulnerability), the servers have not been unavailable (unplanned) once in that time. Over 99.99%+ for us.
In the world of anecdotes...
We had two 1999 Compaq Proliant 800 servers that ran NT4 SP6a. Both made it a DECADE without unplanned downtime. 100% update, for a decade! They were finally retired because.. they were ridiculous by the time that they retired.
-
@scottalanmiller said in What % is normal availability?:
@jimmy9008 said in What % is normal availability?:
We have had our four Dell R630s coming up to two years now. Excluding planned maintenance (for example patching for the Intel vulnerability), the servers have not been unavailable (unplanned) once in that time. Over 99.99%+ for us.
In the world of anecdotes...
We had two 1999 Compaq Proliant 800 servers that ran NT4 SP6a. Both made it a DECADE without unplanned downtime. 100% update, for a decade! They were finally retired because.. they were ridiculous by the time that they retired.
:face_with_tears_of_joy: but it shows that hardware can be high uptime with well ran maintenance, no need for complexity. Our downtime comes from the developers/app builders rather than the hardware.
-
@jimmy9008 said in What % is normal availability?:
@scottalanmiller said in What % is normal availability?:
@jimmy9008 said in What % is normal availability?:
We have had our four Dell R630s coming up to two years now. Excluding planned maintenance (for example patching for the Intel vulnerability), the servers have not been unavailable (unplanned) once in that time. Over 99.99%+ for us.
In the world of anecdotes...
We had two 1999 Compaq Proliant 800 servers that ran NT4 SP6a. Both made it a DECADE without unplanned downtime. 100% update, for a decade! They were finally retired because.. they were ridiculous by the time that they retired.
:face_with_tears_of_joy: but it shows that hardware can be high uptime with well ran maintenance, no need for complexity. Our downtime comes from the developers/app builders rather than the hardware.
Oh gosh, yeah.
When in my massive environment (80K+ servers) our outages were caused by, in this order...
- Developers and their code.
- System admin mistakes.
- SAN
- Datacenter level failures
- Hardware failures that don't support redundancy at this class (essentially memory failures.)
Things that never (during my decade tenure) caused outages...
- Power Supplies (due to redundancy)
- Fans (due to redundancy)
- Hard Drives (due to redundancy)
-
@scottalanmiller said in What % is normal availability?:
When in my massive environment (80K+ servers)
80k+ servers? That has to be something in size like Paypal or LinkedIn.
-
@pete-s said in What % is normal availability?:
@scottalanmiller said in What % is normal availability?:
When in my massive environment (80K+ servers)
80k+ servers? That has to be something in size like Paypal or LinkedIn.
Way bigger than those. Those would not come close.