How Does Local Storage Offer High Availability

scottalanmiller

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

Because they are two different things and the term RAID is only referring to the drives. More importantly, why would someone assume the two were combined?

Why does anyone talk about redundancy instead of reliability? Who knows, but once people talk about redundancy as a proxy for reliability, bad things will happen.

scottalanmiller

@dafyre said:

I have used these examples with you before. My SAN cluster appears to improve reliability at the top layer. In reality, one node blew out two drives last week, and since it was RAID 5, one node was down until we got new drives in it.

If we had been running on a single SAN device, then we would have been totally dead. However, since we had two that were fully replicated with automagic failover, nobody noticed a thing, and therefore our reliability appears to have increased because of the redundancy. The individual device reliability did not get better or worse, but it did have a failure.

Well, in that example, the cluster is improving reliability over just having a single SAN node. But the SAN itself is lowering the reliability. The redundancy of the dual SAN nodes must be increasing the system reliability. The SAN itself, lowering it. Then there is an additional question that we do not know of whether the redundant SAN (one positive, one negative) offsets having no SAN and no storage redundancy. Often it is just as good to do neither than to do both (but way cheaper.)

But the issues in your example are that you have a lot of pieces all affecting real and perceived reliability. In that case at least some of the redundancy is good, some we don't know.

dafyre

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

Dashrender

@scottalanmiller said:

@BBigford said:

@travisdh1 I do remember that one but I thought I had read SAM say something else. Maybe I'm just crazy. I'm probably crazy. It could very well have been that though. That feature was completely misleading and criminal to even put on a feature sheet.

As long as they only called it redundant. Most IT people don't care about reliability, they just want redundancy. So why not sell it to them?

What good is redundancy if it's not reliable?

dafyre

@scottalanmiller said:

@dafyre said:

I guess I can see your point about the drives being redundant, and the data is not... but why would you extricate the two?

Because they are two different things and the term RAID is only referring to the drives. More importantly, why would someone assume the two were combined?

Why does anyone talk about redundancy instead of reliability? Who knows, but once people talk about redundancy as a proxy for reliability, bad things will happen.

I meant to say why would you extricate drives and the data... what good are the drives without data?

scottalanmiller

Using the dictionary definition of engineering redundancy, how would we define something like tightly coupled controllers?

Can then failover and keep working? Yes, they can.

Are they likely to do so? No.

Is the additional risk introduced by having two objects to possibly fail and to likely cause their peer to fail offset by the possibility of failover sometimes? No.

So this is how I see it:

English Redundancy: Yes, there are two items.
Engineering Redundancy: Yes, they can failover.
Increased Reliability: No, the resulting system has become more fragile.

Even engineering redundancy can lead to fragility if we don't look at the system holistically.

dafyre

@Dashrender said:

@scottalanmiller said:

@BBigford said:

@travisdh1 I do remember that one but I thought I had read SAM say something else. Maybe I'm just crazy. I'm probably crazy. It could very well have been that though. That feature was completely misleading and criminal to even put on a feature sheet.

As long as they only called it redundant. Most IT people don't care about reliability, they just want redundancy. So why not sell it to them?

What good is redundancy if it's not reliable?

You won't know how reliable your redundancy is until you have something fail.

scottalanmiller

@dafyre said:

I meant to say why would you extricate drives and the data... what good are the drives without data?

What if their job is just to be a cache and they can keep working just fine with a reduced drive count? Drive and data are not the same thing. While they are assumed to be associated, and certainly often are, they are different things. We can't just merge them, we lose the ability to talk about them individually.

scottalanmiller

@Dashrender said:

What good is redundancy if it's not reliable?

It's no good. Which means it would be crazy to ever seek redundancy instead of reliability.

scottalanmiller

@dafyre said:

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

I see. I guess it doesn't look reliable to people who know. If you say to an average person on the street "I have two servers and he has a mainframe, which is more reliable" I bet they'd say the mainframe because that's how people think. It's rare that people are used to "two cheap things" is better than "one good thing."

Like "I'll give you two cheap Bics in exchange for your $100 pen", most people would be like "your nuts, this will last forever."

dafyre

@scottalanmiller said:

Using the dictionary definition of engineering redundancy, how would we define something like tightly coupled controllers?

Can then failover and keep working? Yes, they can.

Are they likely to do so? No.

Is the additional risk introduced by having two objects to possibly fail and to likely cause their peer to fail offset by the possibility of failover sometimes? No.

So this is how I see it:

English Redundancy: Yes, there are two items.
Engineering Redundancy: Yes, they can failover.
Increased Reliability: No, the resulting system has become more fragile.

^ Now I understand your thought process in this.

Even engineering redundancy can lead to fragility if we don't look at the system holistically.

If the building of redundancy leads to fragility, then something is wrong, IMHO.

wirestyle22

Semantics?

scottalanmiller

@dafyre said:

If the building of redundancy leads to fragility, then something is wrong, IMHO.

That's where it is debatable. Because the goal of an engineer would be reliability. But the goal of a salesman is sales. If the customer demands redundancy and not reliability, then the cheapest path to redundancy is the right one. But on the business empathy cap and it gets murky. Give the customer what they want is never wrong, right?

dafyre

@scottalanmiller said:

@dafyre said:

@scottalanmiller said:

I'm always talking about the resulting system reliability.

Right.

So you have one server that is highly reliable. It has no unplanned down time until one day, the RAID controller in the system shorts out and takes out three hard drives. You spend 2 hours down, waiting on parts, and an additional 4 hours restoring from backups.

Now I have two servers that are plain reliable and replicated with failover, etc, etc. Neither of them has any unplanned down time until one day, the RAID controller burns out on one of the nodes. I spend 2 hours waiting on parts, and 1 hour re-installing my OS, and getting the system set back up for replication. I suffer from zero down time.

Which system is more reliable? Yours, of course. A single system would win at reliability.

Which one looks more reliable (it's all about the appearance). Mine would, because as far as my end-users are concerned the system (as a whole, all moving parts involved) did not go down.

I see. I guess it doesn't look reliable to people who know. If you say to an average person on the street "I have two servers and he has a mainframe, which is more reliable" I bet they'd say the mainframe because that's how people think. It's rare that people are used to "two cheap things" is better than "one good thing."

Like "I'll give you two cheap Bics in exchange for your $100 pen", most people would be like "your nuts, this will last forever."

I'd take a few packs of cheap Bics. You can have my $100 pen. It might burst and start leaking tomorrow. The chances of all 20 or 30 Bic pens leaking and bursting tomorrow are slim.

scottalanmiller

@wirestyle22 said:

Semantics?

Semantics are one of the most important things in IT. This isn't a theoretical experiment in language, this is a real problem that plagues SMB IT every day. Go on Spiceworks and the average conversation around storage is someone being hoodwinked by this very bit of semantics. They request the wrong thing, they get what they ask for and they end up paying a lot and getting something negative.

scottalanmiller

@Dashrender said:

What good is redundancy if it's not reliable?

That's what this whole thread is trying to convey. That IT Pros should never be asking for redundancy as a goal. It is always resulting reliability. Always, no exceptions.

The issue is not that people are building reliability where it isn't useful, it is that people are demanding redundancy without reason. Given that redundancy is only a means to an end (or a proximate goal rather than a real goal) no one should request it, they need reliability. If redundancy provides that reliability, no problem. If magic fairy dust does, that's fine too.

dafyre

@scottalanmiller said:

@dafyre said:

If the building of redundancy leads to fragility, then something is wrong, IMHO.

That's where it is debatable. Because the goal of an engineer would be reliability. But the goal of a salesman is sales. If the customer demands redundancy and not reliability, then the cheapest path to redundancy is the right one. But on the business empathy cap and it gets murky. Give the customer what they want is never wrong, right?

When the customer knows what they are getting themselves into, then yes, by all means give them what they want. Up until my experience with an almost fully virtualized infrastructure, I would rather have reliable servers.

However, after my experience with virtualized infrastructure, my mindset changed.

scottalanmiller

@dafyre said:

When the customer knows what they are getting themselves into, then yes, by all means give them what they want.

Why do we care if they know what they want and why is the vendor to judge their desires?

wirestyle22

@scottalanmiller said:

@wirestyle22 said:

Semantics?

Semantics are one of the most important things in IT. This isn't a theoretical experiment in language, this is a real problem that plagues SMB IT every day. Go on Spiceworks and the average conversation around storage is someone being hoodwinked by this very bit of semantics. They request the wrong thing, they get what they ask for and they end up paying a lot and getting something negative.

I understand everything that you guys have said here but you both agree. That is the confusing part of it for me.

Redundancy doesn't mean reliability.
Reliability doesn't mean Redundancy.

I would rather have my users complain up and down, calling me the worst SysAdmin ever yet have a better system overall. I think complication with no real reward is a huge problem in IT from what I have read and experienced.

Take my opinion with a grain of salt though. I have never made incredible claims about my knowledge. I can only speak of my experiences.

scottalanmiller

@dafyre said:

Up until my experience with an almost fully virtualized infrastructure, I would rather have reliable servers.

However, after my experience with virtualized infrastructure, my mindset changed.

It should not change. Resultant reliability is the only value.