How Does Local Storage Offer High Availability

dafyre

Maybe this is a good example:

Q: Is RAID 0 Redundant?

A: Actually, yes it is. It is redundant in the way that the term is used meaning that the array has "more than one disk." Does it have mirroring, parity, erasure encoding or any other means of protecting against disk failure? No, but the term redundant doesn't imply that. That's why the term RAID is still used, even when RAID 0 makes things far less reliable than if we did not have RAID at all.

This is where we will have to disagree. If you google "define redundancy" without quotes, the second definition is listed as:

the inclusion of extra components that are not strictly necessary to functioning, in case of failure in other components.

In case of using individual drives, RAID 0 is not redundant at all. Because the other drives are necessary for the RAID 0 to function. In the case of a single disk failure, RAID 0 becomes lost data.

I would suggest more to meet the definition of redundancy, RAID 1 would be a better suggestion. In RAID 1, a single disk failure becomes a need to replace the disk, but no lost data, and no down time (assuming hot swappable drives).

@scottalanmiller said:

What if someone makes a single product that is more reliable than two redundant devices can equal?

At that point, it becomes show me a device that won't fail when it experiences a hardware problem. Then we're back to "It has two controllers."

Adding more components into the same chassis does not make it more reliable. It makes it a larger SPOF.

Once we start using redundancy as a proxy for reliability we start to be at risk of taking a current scenario (the server we have today is risky and needs a second node to be safe enough to use) and stop realizing when factors change. It's memorizing the solution pattern for a specific scenario and missing the factors that drive us to that pattern at the time.

I agree that we have to be careful to not recommend solutions by rote memorization. That's how a lot of folks wind up with RAID 5, instead of RAID 6 or OBR10. We also have to be careful, as you say, to get the best bang for our buck when building solutions. There are times when the business simply can't afford two servers for the project, and you have to make a good, reliable one and have good backups.

I get what you are saying about reliability. I think we are talking about two different types of reliability. You are speaking of a single device reliability ( a single server). I am thinking of perceived reliability -- the reliability of the whole system.

I have used these examples with you before. My SAN cluster appears to improve reliability at the top layer. In reality, one node blew out two drives last week, and since it was RAID 5, one node was down until we got new drives in it.

If we had been running on a single SAN device, then we would have been totally dead. However, since we had two that were fully replicated with automagic failover, nobody noticed a thing, and therefore our reliability appears to have increased because of the redundancy. The individual device reliability did not get better or worse, but it did have a failure.

No business cares that they have two servers, they care that things work and that money isn't wasted. That's all. If that is achieved with redundancy, that's fine. But if it is achieved in another way, that's fine too.

^ This, I can definitely agree with.

In all cases, without exception, it is the resulting reliability, cost and risk that determine what is a good solution.

Assuming you mean perceived reliability of the entire system (whether that be one server or ten), then I can agree with that.

Redundancy is commonly a part of a good solution, but never always. In many cases the cost of redundancy is higher than the risk it mitigates.

This is where the IT department has to understand the business and the end-goal that they have been mandated to achieve, and like I said before -- with the best bang for the buck.

scottalanmiller

@dafyre said:

This is where we will have to disagree. If you google "define redundancy" without quotes, the second definition is listed as:
the inclusion of extra components that are not strictly necessary to functioning, in case of failure in other components.

That's true, that's a definition of redundancy, but not the one in use. Since redundancy can mean that but does not necessarily, it doesn't really matter.

Even using the extra definition, it becomes difficult to asses when something taking over from another component still doesn't improve reliability. The result remains the same, redundancy is about having extra things, and doesn't necessarily improve the end goal of reliability.

Just look at the HP MSA SAN devices. They are truly redundant by both definitions. Yet their redundancy loses reliability. So regardless of definition, redundancy on its own is never a goal, it's only a means to an end and never a given.

Here is the Cambridge Dictionary's definition of redundant. Nothing suggesting what you found on Google:

redundant

scottalanmiller

@dafyre said:

In case of using individual drives, RAID 0 is not redundant at all. Because the other drives are necessary for the RAID 0 to function. In the case of a single disk failure, RAID 0 becomes lost data.

In the case of drives it IS redundancy. You only need one drive. Now you have two or more. It is the DATA that is not redundant in RAID 0. The drives are very redundant. That's the difference. RAID refers to the drives explicitly, not the data on the drives.

scottalanmiller

@dafyre said:

I would suggest more to meet the definition of redundancy, RAID 1 would be a better suggestion. In RAID 1, a single disk failure becomes a need to replace the disk, but no lost data, and no down time (assuming hot swappable drives).

That's defining failover, not redundancy. And you are referring to the data, not the drives. If I have a RAID 0 failure and one dies, I still have a working drive. It's redundant at the drive level by either definition of redundant.

scottalanmiller

@dafyre said:

What if someone makes a single product that is more reliable than two redundant devices can equal?

At that point, it becomes show me a device that won't fail when it experiences a hardware problem. Then we're back to "It has two controllers."

The point is that you are stuck on redundancy when it doesn't matter. You say "when it fails." But that's not relevant. What we care about is always the reliability of the whole system. And what we care about is that the whole system doesn't fail. A brick is less likely to fail that two pieces of wood, redundancy doesn't overcome the weakness of the wood.

The thing that changes here is the change of the thing failing at all. Sure IF it fails it would fail. But it's not likely to fail.

If one thing is less likely to fail than two things are to both fail, the single thing is more reliable.

scottalanmiller

@dafyre said:

Adding more components into the same chassis does not make it more reliable. It makes it a larger SPOF.

SPOF is a very dangerous term because it drives an emotional response and gets people to look away from the full system reliability. Adding more components into a single chassis may make it a bigger SPOF, it might make it more fragile or it might make it more reliable. Look at an EMC VMAX or an IBM z/90. They are SPOFS and you've never met someone whose had one die, ever. They run for decades. They use single device reliability instead of multiple device failover to achieve reliability.

That something is a SPOF isn't a problem. The only thing that matters is resulting reliability.

dafyre

@scottalanmiller said:

Just look at the HP MSA SAN devices. They are truly redundant by both definitions. Yet their redundancy loses reliability. So regardless of definition, redundancy on its own is never a goal, it's only a means to an end and never a given.

But see, you are going for device reliability, I'm not arguing that. I agree that here, having redundancy does nothing to help the reliability of the individual devices.

I am speaking hear to the appearance of reliability. The redundant system takes over when the main one fails, thus the system, overall, appears more reliable.

Here is the Cambridge Dictionary's definition of redundant. Nothing suggesting what you found on Google:

Dictionary.com -- http://dictionary.reference.com/browse/redundant?s=t (check out part d)

0_1455808295523_upload-057ff4ad-7bbf-4a8d-8575-a24230ea3a8a

scottalanmiller

@dafyre said:

I get what you are saying about reliability. I think we are talking about two different types of reliability. You are speaking of a single device reliability ( a single server). I am thinking of perceived reliability -- the reliability of the whole system.

I'm talking about both. The reliability of the components of a system are the factors that lead to the resulting system reliability.

I can built a system with high availability with a single server or with a cluster of them. The one requires the individual server to be highly reliable itself, the other requires that multiple servers be able to fail to one another. Two different approaches, each with their own challenges. One is not inherently better than the other.

The problems arise when people start to assume that servers have a fixed reliability and therefore lead to a relatively fixed system reliability. But this isn't remotely true. An HP MSA has a very high single device failure rate while an Oracle M5000 or IBM z/90 have insanely low failure rates. I'll take a single M5000 over a cluster of cheap crappy servers any day.

scottalanmiller

@dafyre said:

But see, you are going for device reliability, I'm not arguing that. I agree that here, having redundancy does nothing to help the reliability of the individual devices.

No, I'm not. I'm talking about the system. Redundancy is never the goal. Never. The final goal is always system reliability. That system reliability might be achieved through highly reliability individual devices (the brick) or failover of less reliability redundant devices (marshmallows.)

I'm always talking about the resulting system reliability.

dafyre

@scottalanmiller said:

@dafyre said:

In case of using individual drives, RAID 0 is not redundant at all. Because the other drives are necessary for the RAID 0 to function. In the case of a single disk failure, RAID 0 becomes lost data.

In the case of drives it IS redundancy. You only need one drive. Now you have two or more. It is the DATA that is not redundant in RAID 0. The drives are very redundant. That's the difference. RAID refers to the drives explicitly, not the data on the drives.

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

scottalanmiller

@dafyre said:

Dictionary.com -- http://dictionary.reference.com/browse/redundant?s=t (check out part d)

Yes, there is an engineering definition of redundant that can mean that. But even using that definition, redundant does not mean what people think that it does. RAID 0 still has working drives even after the data is lost.

scottalanmiller

@dafyre said:

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

Because they are two different things and the term RAID is only referring to the drives. More importantly, why would someone assume the two were combined?

Why does anyone talk about redundancy instead of reliability? Who knows, but once people talk about redundancy as a proxy for reliability, bad things will happen.

scottalanmiller

@dafyre said:

I have used these examples with you before. My SAN cluster appears to improve reliability at the top layer. In reality, one node blew out two drives last week, and since it was RAID 5, one node was down until we got new drives in it.

If we had been running on a single SAN device, then we would have been totally dead. However, since we had two that were fully replicated with automagic failover, nobody noticed a thing, and therefore our reliability appears to have increased because of the redundancy. The individual device reliability did not get better or worse, but it did have a failure.

Well, in that example, the cluster is improving reliability over just having a single SAN node. But the SAN itself is lowering the reliability. The redundancy of the dual SAN nodes must be increasing the system reliability. The SAN itself, lowering it. Then there is an additional question that we do not know of whether the redundant SAN (one positive, one negative) offsets having no SAN and no storage redundancy. Often it is just as good to do neither than to do both (but way cheaper.)

But the issues in your example are that you have a lot of pieces all affecting real and perceived reliability. In that case at least some of the redundancy is good, some we don't know.

dafyre

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

Dashrender

@scottalanmiller said:

@BBigford said:

@travisdh1 I do remember that one but I thought I had read SAM say something else. Maybe I'm just crazy. I'm probably crazy. It could very well have been that though. That feature was completely misleading and criminal to even put on a feature sheet.

As long as they only called it redundant. Most IT people don't care about reliability, they just want redundancy. So why not sell it to them?

What good is redundancy if it's not reliable?

dafyre

@scottalanmiller said:

@dafyre said:

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

Because they are two different things and the term RAID is only referring to the drives. More importantly, why would someone assume the two were combined?

Why does anyone talk about redundancy instead of reliability? Who knows, but once people talk about redundancy as a proxy for reliability, bad things will happen.

I meant to say why would you extricate drives and the data... what good are the drives without data?

scottalanmiller

Using the dictionary definition of engineering redundancy, how would we define something like tightly coupled controllers?

Can then failover and keep working? Yes, they can.

Are they likely to do so? No.

Is the additional risk introduced by having two objects to possibly fail and to likely cause their peer to fail offset by the possibility of failover sometimes? No.

So this is how I see it:

English Redundancy: Yes, there are two items.
Engineering Redundancy: Yes, they can failover.
Increased Reliability: No, the resulting system has become more fragile.

Even engineering redundancy can lead to fragility if we don't look at the system holistically.

dafyre

@Dashrender said:

@scottalanmiller said:

@BBigford said:

@travisdh1 I do remember that one but I thought I had read SAM say something else. Maybe I'm just crazy. I'm probably crazy. It could very well have been that though. That feature was completely misleading and criminal to even put on a feature sheet.

As long as they only called it redundant. Most IT people don't care about reliability, they just want redundancy. So why not sell it to them?

What good is redundancy if it's not reliable?

You won't know how reliable your redundancy is until you have something fail.

scottalanmiller

@dafyre said:

I meant to say why would you extricate drives and the data... what good are the drives without data?

What if their job is just to be a cache and they can keep working just fine with a reduced drive count? Drive and data are not the same thing. While they are assumed to be associated, and certainly often are, they are different things. We can't just merge them, we lose the ability to talk about them individually.

scottalanmiller

@Dashrender said:

What good is redundancy if it's not reliable?

It's no good. Which means it would be crazy to ever seek redundancy instead of reliability.