Redundancy is Never a Goal, Reliability is a Goal, Redundancy is a Tool

scottalanmiller

The title pretty much says it all. Too often, nearly always in fact, in IT I hear technical professionals and sometimes even business people talk about redundancy where it makes no sense. Not that having redundancy itself is bad for them, but they talk about it instead of talking about the thing that matters: *reliability.

Redundancy simply refers to having more than one of something. As a business, at the goal level, I never care about this. I don't care if I have one drive or two in my servers, I only care that I don't lose data however that is achieved. I don't care if my airplane has one pilot or two, as long as it doesn't crash. If we had a magic way to make data safe without RAID or an airplane safer without pilots that would be better, not worse.

But too often in IT we find people looking for redundancy for its own sake, as if buying two of things is a goal. But it never is. Never. Redundancy is only ever a tool used in the hopes of gaining reliability. Often there are ways to obtain reliability through other means and often the cost of redundancy is greater than the value of the reliability that it provides. This is why companies often opt not to have redundant data centers, ISPs, or even servers in many cases - the cost outweighs the value.

When we discuss our needs or the importance of reliability in decisions it is critical that we talk about reliability. There is a reason why reliability is referred to in terms of resultant value like getting five nines of uptime and redundancy is only a checkbox regardless of how good (triple mirrored RAID 1 or active/active EMC SAN controllers) or how bad (RAID 5 on old SATA disks or tightly dependent SAN controllers) that redundancy is for us. In some cases, like large RAID 5 sets or low end dual controllers SANs, redundancy often actually lowers reliability in the real world rather than raising it!! But many IT professionals and even some business professionals get confused and talk about redundancy instead of reliability and fail to understand why these solutions can be not just bad but really, really bad because they've lost sight of why they were implementing redundancy in the first place - because they thought that it would provide enough of an improvement in reliability to offset the cost of the implementation and any additional risks that it would introduce itself.

The reality is is that understanding reliability is challenging and not many IT professionals and sadly even many business professionals are not trained in risk and struggle to be able to evaluate the reliability of two different solutions, products, designs or techniques. Because reliability is hard to understand and convey it has become a simple "out" to use the term redundancy as a proxy for reliability - nearly everyone understands that reliability is the goal and they just assume that redundancy is good, non-redundant is bad but that simply isn't the case. All redundancy needs to offset its cost to have any hope of being valuable and that means that only certain types of redundancy that improve reliability rather than hurting it even have a chance of being considered for that.

Redundancy is a crucial tool in the IT professional or any engineer's toolbox, one that we use almost daily and think about constantly. But never lose sight of why we use it. Redundancy is never what matters, only what redundancy might provide us is what matters. Not all redundancy is created equal and not all redundancy, even when it improves reliability, has value. Redundancy should never be treated as a check box.

Keep your eyes on the goal which is always reliability, never redundancy.

AVI-NetworkGuy

It's stuff like this that gives IT a bad name in terms of it incorrectly being labeled a "cost center". Like you said, reliability is the goal and throwing money away on solutions that not only don't help, but could actively HURT the goal of reliability, perpetuate these financial stigmas with IT. Thanks for this!!

art_of_shred

I'm going to throw a wrench in here sideways. Reliability is definitely key, but isn't redundancy meant to ensure continuity and not reliability? You want the most reliable of pilots flying your plane, that show up every day and do the very best job of piloting possible. But, if one should suddenly have a heart attack, I really want to know that there is redundancy (co-pilot) to ensure that the plane stays in the sky. Just getting 2 pilots from wherever, to ensure that the plane can stay in the sky isn't the best idea, but I'm still not entirely comfy with thinking that a really good pilot negates, or even reduces, the need for a co-pilot.

scottalanmiller

@art_of_shred said:

I'm going to throw a wrench in here sideways. Reliability is definitely key, but isn't redundancy meant to ensure continuity and not reliability?

System level reliability would be more or less the same thing as continuity. Can you be reliably working, that means you have continuity.

Dashrender

In the case of the plane example - you need have reliability in the pilots, one of the ways to get that is by having a redundant pilot.

dafyre

@scottalanmiller When you say "System level" -- I assume you mean across the entire system... be that one server or ten.

Am I right?

Dashrender

When you ask about continuity you have to ask about uptime - what is your uptime requirement? Let's assume you're using a Unitrends appliance, you can afford 15 mins of down time. Perhaps instead of buying two VM hosts replicated datat (or shared SAN/NAS, etc) you decide that you can spin up the image from the Unitrends box. That's not redundancy in my book, but gets your reliability objective.

scottalanmiller

@art_of_shred said:

But, if one should suddenly have a heart attack, I really want to know that there is redundancy (co-pilot) to ensure that the plane stays in the sky.

Is that really what you care about? I don't. I care that everyone survives unhurt. I don't care how it happens. If it is handled through having a co-pilot, that's fine. If it is handled by having the plane land itself, thats fine. It is it a magical hand that goes up in the sky and pulls the plane to safety, that's fine. Sure, a co-pilot is probably the easiest way to address this given current technology, but it's not a goal, it's a tool. If a better way exists, I expect them to use that. It's the reliability of the flight system (measured in reliably safe flights) that matters to me, if they achieve that with pilots, monkeys, computers, hamsters, magic... I don't care.

scottalanmiller

@dafyre said:

@scottalanmiller When you say "System level" -- I assume you mean across the entire system... be that one server or ten.

Am I right?

Depends on the level that matters. The simplest way that most people think of it is "delivering a service to the end users" or "continuation of functionality". So if users need email, it would be the ability to deliver email services to customers that would be measured. Of course at the business level its "ability to communicate" not "gets email" but the interface between the business' theoretical needs and concrete implementation has to happen somewhere so that it is practical, defined and measurable. So once email is the agreed upon service, you would measure at the "availability of email" level to know what reliability the IT department is delivering.

If servers fail, storage fails, ISPs go down, buildings burn, etc. don't matter to the system as long as email continues to work, however that happens. Basically look at the ends, not the means.

scottalanmiller

@art_of_shred said:

....but I'm still not entirely comfy with thinking that a really good pilot negates, or even reduces, the need for a co-pilot.

The reason that you feel this way, and you are correct, is because you are already looking "under the hood" of the system and looking at "how" to make a system reliable yourself rather than looking at the reliability of the system itself.

Let's take this into IT. A pilot is like a hard drive and an airplane is like a server. Would be implement a server without a minimum of RAID 1 redundancy to protect out storage? Nope, of course not. However, that's under the hood and something that we know is a practical, every day means of accomplishing reliability for this particular scenario - it's an implementation pattern. That's fine. That redundancy is never a goal does not suggest that it is not the most common tool to achieve the goal, just that it is really important not to mistake it for the goal. Just like two pilots is the most common pattern for achieving airplane reliability. There are other ways to do both of these things, but these two are such well established, effective patterns that it is insanely uncommon to deviate from them. But over time, new technologies or ideas might come forth that make these obsolete and other approaches might be safer, cheaper or easier to get better reliability and at that point we should instantly change how we do things because double hard drives or double pilots isn't a goal, it is simply a means to an end with the end being reliability or continuity or safety, however you want to look at that.

scottalanmiller

@Dashrender said:

In the case of the plane example - you need have reliability in the pilots, one of the ways to get that is by having a redundant pilot.

You can think of it as pilot level reliability, but that is based around the assumption that pilots are necessary. Handy for engineers to determine that pilots are necessary for plane level reliability and then focusing on the the pilot risks, but overall it is plane reliability, not pilot reliability, that the people riding in the plane, the investors in the airline, the government agencies auditing flight data care about. They want people getting up and down safely, if that can happen without pilots, great. It can't today, but employing pilots isn't the goal, it's just the best means to that goal currently.

Dashrender

@scottalanmiller said:

@Dashrender said:

In the case of the plane example - you need have reliability in the pilots, one of the ways to get that is by having a redundant pilot.

You can think of it as pilot level reliability, but that is based around the assumption that pilots are necessary. Handy for engineers to determine that pilots are necessary for plane level reliability and then focusing on the the pilot risks, but overall it is plane reliability, not pilot reliability, that the people riding in the plane, the investors in the airline, the government agencies auditing flight data care about. They want people getting up and down safely, if that can happen without pilots, great. It can't today, but employing pilots isn't the goal, it's just the best means to that goal currently.

yeah, the pilots are like your harddrive example.

scottalanmiller

Exactly. We all know that using hard drives for storage is practical and having them in RAID is the only practical way to make them safe in normal servers. Simple, proven, effective. But what if someone invented storage that was more reliable than hard drives in RAID 1 without needing redundancy of drives? Would we still use RAID 1 with hard drives? Of course not, redundancy isn't the goal, protecting the data and maintaining uptime are.

Or what if new techniques come out like RAIN that work for our servers (say we have a bigger cluster like a two node HyperV cluster with StarWind or a Scale cluster) then suddenly RAID might not make sense because the "system" is bigger and RAID is no longer the best way to handle it.

dafyre

@scottalanmiller That is where keeping your IT Skillset and knowledge up to date come into play. At my last job, I got sent to 1 training seminar over the course of ten years... My current IT knowledge was all built around things that I did at home, in my own time, when I had any to speak of.

Just because the old way of doesn't something has a newer counter part doesn't mean the old way is necessarily shoved to the wayside right away.

Dashrender

@dafyre said:

Just because the old way of doesn't something has a newer counter part doesn't mean the old way is necessarily shoved to the wayside right away.

Actually I think Scott said exactly the opposite. Using the harddrive example. Today you do RAID 1 as a minimum, but if a new tech came out tomorrow that didn't require two drives (and assuming cost was the same or less) you should be moving to that new tech for new projects.

scottalanmiller

Both are true @dafyre is correct that new doesn't always mean better. And @Dashrender is right that better supersedes "always used it" in value. But what I was saying is simply that we change based on the results and don't really care about new, traditional or any other under the hood artifact.

dafyre

@scottalanmiller said:

Both are true @dafyre is correct that new doesn't always mean better. And @Dashrender is right that better supersedes "always used it" in value. But what I was saying is simply that we change based on the results and don't really care about new, traditional or any other under the hood artifact.

I generally don't want new methods when it comes to building a server... I want what is tried and true. If somebody comes up with...say... a new FS that just magically saves all your data on all the computers at your employer (Oh wait... somebody look up Aetherstore!)... I don't want to jump to that right away... We'll let somebody else do the testing on it... and after it ha proven itself, then at our next server rollout, we can talk about it.

scottalanmiller

Just realized that this topic actually was missing the tags! Ugh, no wonder if rarely comes up in searches. Fixed, finally.