What % is normal availability?

1337

@scottalanmiller Looking for some statistics on the subject of server reliability I found the data below.
While I don't know where the data originates from just looking at servers from 1 to 6 years of age we get an average of 3.27 hours of downtime. That translates into 99.96% of availability.
It's a little less than what you estimated but still in the same ballpark I think.

scottalanmiller

@pete-s and it's a tough question with "what does downtime mean?" Because lots of shops CHOOSE to continue downtime of one service to avoid losing another, or refuse pro-active maintenance to shift performance impact windows, and so forth. Those decisions to intentionally increase risk or downtime should not be included as they are not outages caused by the server.

I'm sure that the average downtime of servers is higher (as in worse) than the 99.99% number. But if we were to filter the numbers for "when kept in proper datacenters, with proper pro-active monitoring and maintenance, and proper parts replacement practices", this total average suggests that it must be higher than even 99.995%!

scottalanmiller

The same server, treated the same between two companies, will still have very different uptimes based on environmental conditions. A good datacenter will totally change the uptime.

The same server in the same environment, but with different workloads, disk configurations, etc. will have different uptimes.

The chart above suggests that people with RAID 0, no RAID, RAID 5 on SATA disks, etc. That people without dual power supplies. People without proper air conditioning. People without UPS. Are all lumped in with people using their servers "properly" under good conditions.

So if you take away those people and say "only enterprise class servers, with dual power supplies, healthy UPS conditioning, stable HVAC with X temp and humidity variance, dust remittance, RAID 10 on enterprise drives, etc." you get an amazingly different reliability number.

scottalanmiller

Now what is really useful from the chart is that somewhere between six and seven years the downtime risks appear to double on average. That's a REALLY useful number for risk analysis.

1337

@scottalanmiller What you say makes sense.

Consider a virtual environment with a consolidation ratio of say 10 for the sake of simplicity.
Assume average 1 hour of yearly downtime on a physical server.

The probability of hardware failure on the VM host is the same as for a single server but every hour of downtime would be 1x10=10 hours total of downtime for the VMs.

Having 10 physical servers would mean 10 times the probability of hardware failure so 10x1=10 hours of downtime.

The net effect of this would be that the average availability per server (virtual or physical) would be the same.

1337

I would have thought the average availability to be lower somehow.

Given how reliable the equipment really is makes me wonder who really needs high availability. Most people aren't hosting NASDAQ servers exactly.

scottalanmiller

@pete-s said in What % is normal availability?:

The probability of hardware failure on the VM host is the same as for a single server but every hour of downtime would be 1x10=10 hours total of downtime for the VMs.

Having 10 physical servers would mean 10 times the probability of hardware failure so 10x1=10 hours of downtime.

That's true in a sense. But that's not how downtime is measured. The server is down, not the workloads. If you break up workloads you could go to no end. For example, what if each process were a workload? Each VM might have over a thousand processes. Yet the end result is one server down for one hour.

Think of a car. How long was the car broken down? Two days. It doesn't become eight days, even if there were three passengers plus the driver.

Also, if you took ten servers each with one VM, or one server with ten VMs, you magnify the risk equal to the growth rate of workloads. So the average per workload remains the same, with either approach.

Because of the interconnected nature of workloads, the workload risk to the single server approach is tiny compared to the risk of the ten servers, however, in the real world. Most real world factors make the consolidated approach the safer one in almost all cases.

scottalanmiller

@pete-s said in What % is normal availability?:

Given how reliable the equipment really is makes me wonder who really needs high availability. Most people aren't hosting NASDAQ servers exactly.

This is what I've been saying for years. A few things I keep pointing out to people...

HA is about shaving SECONDS, you have to pay for all of that HA in the tiniest amounts of time.
HA carries a lot of overhead and its own risks. Many HA systems induce more downtime than they protect against.
Many HA approaches or solutions only claim to approach uptimes lower than standard, rather than greater than!

bbigford

At about an hour a year, I would say four 9's is acceptable.

scottalanmiller

@bbigford said in What % is normal availability?:

At about an hour a year, I would say four 9's is acceptable.

Yeah, that's pretty good. Especially considering that is one hour at any given time. For a typical business that operates no more than 60 hours a week, the chances that that hour will fall during production hours is pretty low and extremely low if you want it to fall completely without business hours.

For a really typically office environment, four nines uptime from a typical single server starts pushing an average of something like 12 minutes of downtime a year average during production hours. It gets pretty crazy for a typical server in a typical company.

JaredBusch

Also, never forget this is unscheduled downtime.

Some people like to forget that.

Jimmy9008

We have had our four Dell R630s coming up to two years now. Excluding planned maintenance (for example patching for the Intel vulnerability), the servers have not been unavailable (unplanned) once in that time. Over 99.99%+ for us.

Monthly reboots are performed on the physical servers, but these are planned and VMs are migrated to between hosts for application availability during those windows.

What is the normal % availability for a website/app, rather than server?

scottalanmiller

@jimmy9008 said in What % is normal availability?:

We have had our four Dell R630s coming up to two years now. Excluding planned maintenance (for example patching for the Intel vulnerability), the servers have not been unavailable (unplanned) once in that time. Over 99.99%+ for us.

In the world of anecdotes...

We had two 1999 Compaq Proliant 800 servers that ran NT4 SP6a. Both made it a DECADE without unplanned downtime. 100% update, for a decade! They were finally retired because.. they were ridiculous by the time that they retired.

Jimmy9008

@scottalanmiller said in What % is normal availability?:

@jimmy9008 said in What % is normal availability?:

We have had our four Dell R630s coming up to two years now. Excluding planned maintenance (for example patching for the Intel vulnerability), the servers have not been unavailable (unplanned) once in that time. Over 99.99%+ for us.

In the world of anecdotes...

We had two 1999 Compaq Proliant 800 servers that ran NT4 SP6a. Both made it a DECADE without unplanned downtime. 100% update, for a decade! They were finally retired because.. they were ridiculous by the time that they retired.

:face_with_tears_of_joy: but it shows that hardware can be high uptime with well ran maintenance, no need for complexity. Our downtime comes from the developers/app builders rather than the hardware.

scottalanmiller

@jimmy9008 said in What % is normal availability?:

@scottalanmiller said in What % is normal availability?:

@jimmy9008 said in What % is normal availability?:

We have had our four Dell R630s coming up to two years now. Excluding planned maintenance (for example patching for the Intel vulnerability), the servers have not been unavailable (unplanned) once in that time. Over 99.99%+ for us.

In the world of anecdotes...

We had two 1999 Compaq Proliant 800 servers that ran NT4 SP6a. Both made it a DECADE without unplanned downtime. 100% update, for a decade! They were finally retired because.. they were ridiculous by the time that they retired.

:face_with_tears_of_joy: but it shows that hardware can be high uptime with well ran maintenance, no need for complexity. Our downtime comes from the developers/app builders rather than the hardware.

Oh gosh, yeah.

When in my massive environment (80K+ servers) our outages were caused by, in this order...

Developers and their code.
System admin mistakes.
SAN
Datacenter level failures
Hardware failures that don't support redundancy at this class (essentially memory failures.)

Things that never (during my decade tenure) caused outages...

Power Supplies (due to redundancy)
Fans (due to redundancy)
Hard Drives (due to redundancy)

1337

@scottalanmiller said in What % is normal availability?:

When in my massive environment (80K+ servers)

80k+ servers? That has to be something in size like Paypal or LinkedIn.

scottalanmiller

@pete-s said in What % is normal availability?:

@scottalanmiller said in What % is normal availability?:

When in my massive environment (80K+ servers)

80k+ servers? That has to be something in size like Paypal or LinkedIn.

Way bigger than those. Those would not come close.