Why Anecdotes Fail

scottalanmiller

I hear people state anecdotes of system reliability on a regular basis as a demonstration that statistics are wrong. Of course this makes no sense, statistics show averages and norms as well as telling us that large numbers of people will experience higher and lower than expected reliability in any specific case. Stating any single anecdote will simple agree with the statistics, not refute it.

To demonstrate, I will provide my own anecdotal evidence of uptime. At NTG we installed production Compaq Proliant servers in 1999. These servers ran NT 4 and ran, in a chained dependency no less (one DB and one app servers) for a few months shy of ten years. Ten years! And during that time they failed exactly.... never. Not one system outage failure. Not only that but they were both running RAID 5 (my head hangs in shame - not really because in 1999 that totally make sense.)

Those systems were replaced by newer HP Proliants with RAID 10 that have run in production enterprise class datacenters and they themselves have experiences.... zero failures and additionally the new datacenters have never lost power or Internet access. We are now at a whopping fifteen years of continuous uptime. Fifteen years and unlimited nines! Quite literally we are at 100% uptime!

We have no SAN, we have no live failover (we have DR systems but the failover is manual.)

If I was to present this anecdotal evidence in the way that it is often portrayed, I would claim that running a single server provides 100% uptime. Of course that isn't true. Singer servers are much more reliable than people give them credit for, that is very true. And Proliants are amazing servers, also true. But Proliants don't give unlimited uptime. We are lucky. Not insanely lucky, but decently lucky. We also treat our equipment well (at least the new stuff, the old gear was in hot, dusty, high vibration environments!!)

I don't claim that these systems are high availability (HA), of course they are not. They are "standard availability" or SA. They are the middle of the road as far as architecture goes. They are the starting point. Given enough time we'd have downtime. But judging purely from an anecdote, the claims that I can make are ridiculous.

But I do not. I rarely mention these servers because they have been absurdly reliable. We have had no intention of them never going down like this. But they haven't.

So the next time someone brings out an anecdote about how doing something "risky" was a good idea because they got lucky, remember that I've been luckier and would never make such claims. We believe that we made good, solid business decisions around our risk and cost model - we know what downtime will cost when it comes and we know what we will do to get back up and running and we've accepted and financially adsorbed that risk a decade ago. Now, every day, we reap in the benefits of the time/value of money.

gjacobse

I felt similarly regarding MS SBS 2003 - It was a 'huge risk' to have a single server that served so much - I made it worse on myself by (shooting my foot) at only 4GB of memory. But it ran,..daily,.. only one 'failure' which was a virus and I may have been able to resolve it without wiping the system. But I choose to as a precaution.

But that SBS box ran for nearly 6 straight years with very little day to day futzing. We did have downtime,.. but there isn't anything that I could do about the Windstream hub down the road being under 5 feet of water. We were running,.. just no internet.

Still like the SBS model - though in my new office it doesn't work since we have nearly 300 people.

Dashrender

While others might disagree with me, there is nothing wrong with the SBS model. If you were doing it today in an environment that SBS was made for (less than 75 users) you'd probably having email running in it's own VM (and this is where everyone starts yelling that I'm a fool and should be using hosted email/Office 365) and the AD/file/print in a second VM.

Of course you have to take what Scott said about making 'solid business decisions around the risk and cost model' which in most cases for a business of this size will only require a single server with a manual DR plan.

Carnival Boy

When I first considered buying a couple of SANs, on the advice of my vendor, I tried to find out some stats for Proliant reliability. But I couldn't find any. Do HP publish them? I asked around and no-one seemed to know. So I couldn't rely on stats even if I wanted to.

I love stats. I mean really, really love them. But getting hold of accurate ones is pretty tricky. So I rely on anecdotal evidence mostly, not through choice but through necessity.

I'm talking here about independent stats. I'm constantly bombarded with stats from vendors trying to sell me something. But in the same way @scottalanmiller would say never rely on the advice of a vendor, I'd say never rely on the stats of a vendor.

Carnival Boy

Anyway....anecdotally I've been responsible for servers for 3 SMEs over the last 15 years. In that time I've probably got through around 15 Proliant servers. The total downtime during that time is precisely zero. I've also got through hundreds of HP desktops and can count the number of failures on one-hand. So my anecdotal experience is that hardware is incredibly reliable. The only things that generally fail are power supplies and hard drives - but this hasn't resulted in server downtime as these two items have redundancy. I've even run mission critical server software on old re-allocated PCs, which isn't the wisest thing to do, but again, has given me relatively little trouble.

So without any reliable statistics to tell me otherwise, I can only rely on own experience which is that Proliant downtime isn't a big problem for me. I couldn't justify the cost of buying two SANs purely to address a risk that I'd never personally experienced, even if that risk was real.

My personal experience is that software is far, far less reliable than hardware, so my tight budget tends to address making software more reliable and not hardware. I'd be interested to know how many people spend thousands on a SAN but then fail to patch their software in a timely manner. I bet it happens and it's crazy because patching is generally free and SANs are expensive.

If anyone has stats to disprove my theory then I'd love to see them!

scottalanmiller

@Carnival-Boy said:

When I first considered buying a couple of SANs, on the advice of my vendor, I tried to find out some stats for Proliant reliability. But I couldn't find any. Do HP publish them? I asked around and no-one seemed to know. So I couldn't rely on stats even if I wanted to.

I have an article coming about that. No one publishes stats for that stuff because no one else does. Not the SAN vendors, not the server vendors, no one. But what you can know is that almost no SAN vendor makes a server on par with a Proliant. So whatever an equal level SAN is is going to be slightly less reliable than the equivalent Proliant. Just the nature of scale, quality, engineering, etc. But they are roughly identical.

scottalanmiller

@Carnival-Boy said:

I'm talking here about independent stats.

If you think about the billions of dollars that would have to be spend to do a study of a proper scale and that the study would be too old to be useful by the time that it was completed.... there is no way to have stats like that in IT.

Imagine if someone did a current generation Proliant versus PowerEdge study. You'd need at least a thousand servers of each model you want to test and at least ten years to gather the lifetime stats. At $20K per server that is $20,000,000 investment per server so $40,000,000 minimum for a single server to server comparison of a single configuration. And that is before operational costs (power, cooling, etc.) for many years. And when they were done they could tell us at the same time that we've been retiring those servers which ones we "should have" bought.

scottalanmiller

I've done informal studies in massive environments (tens of thousands of servers) and you are correct, hardware almost never fails. The drive to make your hardware redundant is misdirected in most cases. It's chasing a problem that does exist at huge cost and quite often the complexity of the solution causes twice as much downtime as it theoretically protects against.

Carnival Boy

@scottalanmiller said:

But what you can know is that almost no SAN vendor makes a server on par with a Proliant.

IIRC some HP SANs are or were Proliants. I recall the HP 4300 was basically a Proliant. When we were looking at getting a pair, we were basically told our existing Proliant was at risk of failure and in order to mitigate this risk we needed to effectively replace it with two Proliants, plus some software to keep the two in sync. It's adding redundancy to something that I've never personally had fail.

But like the majority of SMEs, we have no redundancy at the software level. We're running single databases for our ERP system and for our Exchange system, for example. So if the database fails we're down. Having a SAN would just mean the failure occurs across two pieces of hardware instead of one.

Another point to make about redundancy. I am really, really confident about the ability of my Proliants to handle disk failure. I've had quite a few over the years, and am now pretty relaxed about the process. That little red light comes on, I phone HP, a new drive arrives, I pop out the old drive, I pop in the new drive, the lights flash, and I walk away. Completely confident that the array will rebuild. It still makes me nervous, but it's a controlled nervousness. I doubt having a SAN fail is anywhere near as straightforward. My point being that I like simple redundancy, I dislike complex redundancy.

scottalanmiller

@Carnival-Boy said:

When I first considered buying a couple of SANs, on the advice of my vendor, I tried to find out some stats for Proliant reliability. But I couldn't find any. Do HP publish them? I asked around and no-one seemed to know. So I couldn't rely on stats even if I wanted to.

I love stats. I mean really, really love them. But getting hold of accurate ones is pretty tricky. So I rely on anecdotal evidence mostly, not through choice but through necessity.

I'm talking here about independent stats. I'm constantly bombarded with stats from vendors trying to sell me something. But in the same way @scottalanmiller would say never rely on the advice of a vendor, I'd say never rely on the stats of a vendor.

Just to be clear, I don't trust vendor stats either. Get nothing that promoted sales from a vendor

scottalanmiller

@Carnival-Boy said:

@scottalanmiller said:

But what you can know is that almost no SAN vendor makes a server on par with a Proliant.

IIRC some HP SANs are or were Proliants. I recall the HP 4300 was basically a Proliant. When we were looking at getting a pair, we were basically told our existing Proliant was at risk of failure and in order to mitigate this risk we needed to effectively replace it with two Proliants, plus some software to keep the two in sync. It's adding redundancy to something that I've never personally had fail.

But like the majority of SMEs, we have no redundancy at the software level. We're running single databases for our ERP system and for our Exchange system, for example. So if the database fails we're down. Having a SAN would just mean the failure occurs across two pieces of hardware instead of one.

Another point to make about redundancy. I am really, really confident about the ability of my Proliants to handle disk failure. I've had quite a few over the years, and am now pretty relaxed about the process. That little red light comes on, I phone HP, a new drive arrives, I pop out the old drive, I pop in the new drive, the lights flash, and I walk away. Completely confident that the array will rebuild. It still makes me nervous, but it's a controlled nervousness. I doubt having a SAN fail is anywhere near as straightforward. My point being that I like simple redundancy, I dislike complex redundancy.

Exactly.

Many HP low end SANs are in fact Proliants. Often setup by DotHill and not by HP.