Resurrecting the Past: a Project Post Mortem

scottalanmiller

The posts about the Dell MD SAN and the EMC VNX SAN from a few hours ago are both examples of this. In both cases someone would demonstrate how yanking one of the controllers out would cause a fault and the other controller would transparently take over. In the case of a controller being yanked, they are pretty reliable. But in the real world, that's not what a failure looks like and typically failures are caused by firmware issues, not physical removal. And physical failures and physical removal are very different things as well. You rarely get split brain during a test, or have a shock to the system.

scottalanmiller

@dafyre said:

When we originally started the virtualization project, everything was running on a MD1000. That is the point at which we quit looking at DAS and started looking at SAN.

Only difference between a DAS and a SAN is the addition of the switching layer. DAS is always superior to SAN, all other things being equal, because it has lower latency and fewer points of failure. There are hard limits as to how many physical hosts can attach to any given DAS, but this can easily be in the hundreds. But a DAS and SAN are the same physical devices, just determined by how they are connected.

But when the option exists, you never choose a SAN unless a DAS can't do what you need physically as the DAS is simply faster and safer by the laws of physics.

dafyre

@scottalanmiller said:

That doesn't describe HA. (snip)

How would you define HA, then? I mean is not the scenario I just described (ie: keeping the systems up despite a major failure?) a form of HA. Yes it is failover, but isn't failover (and failback) a part of HA? Isn't that the point of having a failover cluster?

NB: This was a 2-node + witness active / active cluster.

This redundancy instead of HA is actually what SAN vendors are famous for. Dell and HP, especially, have a track record of making entry level devices that are "fully redundant" but are LA, not SA or HA. They can demonstrate a failover anytime you want, but when they fail they rarely successfully failover and, in fact, fail far more often than a normal server.

Let's call it redundancy, then... our uptime went from 80% to 99% (guestimate based on experience). We were still in a much better situation than we were before. The offices that lost money while we were doing restores were no longer suffering from anywhere near as many interuptions from our servers being down. (This was before I learned a little more business sense to calculate the cost of down time, etc. We presented our solution to the business & financial folks, and they said do it. So we did).

@scottalanmiller said:

How was that happening? That's a different issue. Even with no HA of any sort, a restore once a year should be a huge red flag. What was causing all of these restores?

To this day, we are not sure. I think it was a faulty controller or something. After we got the SAN ^H^H^H storage cluster installed, we move the SQL Server's database files, etc, etc. off of the PowerVault and never looked back. (This was after the PV was out of warranty).

scottalanmiller

@dafyre said:

How would you define HA, then?

Pretty easy, actually. HA = High Availability. So let's start be defining Standard Availability.

What is SA? The most common accepted example of SA is to take a well treated, enterprise commodity server (in the case of systems HA like we are discussing here) and treat that as the baseline. This would generally be the big, mainline servers such as an HP DL380 or a Dell R730. These are the best selling servers in the world, and they are the standard "middle of the road" servers from the enterprise vendors. These are neither the entry level devices nor special case devices. These are the bulk of sales, the standard against which all others are measured and designed for general purpose computing.

So: SA can be defined as the availability of a system consisting of a single, properly maintained enterprise commodity server.

Therefore, what is HA? HA is then simple to describe: High availability is a system that significantly improves upon the availability of an SA system.

Being 1% better isn't enough. And it doesn't matter how it is achieved. The terms give it all away - it is about availability and nothing else. Any other aspect is misdirection and a red herring. IT departments and vendors tend to focus on the "other things" because they are simple. Is this redundant? Simple yes or no. But is it reliable? Oh, I'd have to think about that.

scottalanmiller

@dafyre said:

@scottalanmiller said:

I mean is not the scenario I just described (ie: keeping the systems up despite a major failure?) a form of HA. Yes it is failover, but isn't failover (and failback) a part of HA?

Nothing you described clearly isn't HA, but nothing you described addresses availability at all. You are talking about technical, under the hood things without actually looking at availability of the system.

So no, what you describe is not a "form of HA." HA doesn't have forms, it just is or isn't. HA isn't a thing that you can hold, it is a rate of availability higher than normal.

Failover, fault tolerance, etc. are things people often use to achieve HA, but they are tools, not results. A hammer is a tool to build a house, but buying a hammer doesn't mean you have a house. It just means you have a tool to use to make one, if you choose.

scottalanmiller

@dafyre said:

Isn't that the point of having a failover cluster?

No, the primary purpose of a cluster is business politics - to satisfy a checkbox, not to achieve a business goal like availability. Most vendors offer clustering in a low cost (to build) form in order to address checkboxes because checkboxes are easy.

Remember as @John-Nicholson has said: "HA is something that you do, not something that you buy."

There is no way to ever purchase HA. You can you purchase failover systems. So by the nature of being able to buy them, they aren't HA themselves. This doesn't make them bad, they are just tools, not resultant availability ratings.

Seatbelts don't make a car safe, but they are a tool in improving the safety of a car. Make sense?

scottalanmiller

@dafyre said:

Let's call it redundancy, then... our uptime went from 80% to 99% (guestimate based on experience).

Right, you have redundancy. Which I preach over and over again is never a goal. If someone says they want redundancy, they've lost sight of their goals. Redundancy means you have extra of something, it doesn't imply that it protects you.

Now instead of being caught up in terms like HA, redundancy, clustering, etc. we should talk about the real problems.

How did you have an availability of 80% and how did you only get up to 99%? These are both LA numbers, extremely LA numbers.

SA is generally accepted to be between four and five nines (99.99% - 99.999%) availability. You are talking about orders of magnitude less reliable systems here. Seriously, orders of magnitude.

So you have an issue here that is far, far bigger than what we are discussing and should be investigated. NTG sees over ten nines from SA setups, but we treat them REALLY well. And that's many systems over nearly two decades of running. That includes servers from the 1990s without a lot of modern engineering, cooling and redundancy. We are getting extremely high numbers, we know, but it is important to note that if your clusters aren't getting you into the six nines categories with ease, something is likely very wrong.

scottalanmiller

@dafyre said:

We were still in a much better situation than we were before. The offices that lost money while we were doing restores were no longer suffering from anywhere near as many interuptions from our servers being down.

Granted, you've improved. But not to an industry baseline rate. The improvement, if you are really only getting to two nines or even three, only appears as a win because you are approaching from a very low bar perspective. You've come back from 20% downtime down to 1%, but why is the business seeing any measurable downtime at all still?

scottalanmiller

@dafyre said:

To this day, we are not sure. I think it was a faulty controller or something. After we got the SAN ^H^H^H storage cluster installed, we move the SQL Server's database files, etc, etc. off of the PowerVault and never looked back. (This was after the PV was out of warranty).

So using a SAN is believed to have induced a dramatic LA situation. Using a SAN it is assumed that LA is going to be the result, how could it not? But when we talk about a SAN pushing you into LA (low availability, significantly below the availability of a single server) we are normally assuming three nines at least, 99.9% uptime.

And that's a single SAN, no failover at all.

A two server cluster, no SAN, no DAS, no NAS, alone should blow the doors off of six nines reliably.

dafyre

@scottalanmiller said:

@dafyre said:

@scottalanmiller said:

I mean is not the scenario I just described (ie: keeping the systems up despite a major failure?) a form of HA. Yes it is failover, but isn't failover (and failback) a part of HA?

Nothing you described clearly isn't HA, but nothing you described addresses availability at all. You are talking about technical, under the hood things without actually looking at availability of the system.

So no, what you describe is not a "form of HA." HA doesn't have forms, it just is or isn't. HA isn't a thing that you can hold, it is a rate of availability higher than normal.

Failover, fault tolerance, etc. are things people often use to achieve HA, but they are tools, not results. A hammer is a tool to build a house, but buying a hammer doesn't mean you have a house. It just means you have a tool to use to make one, if you choose.

I'll agree with what you said above.

Okay. We will assume that the SA for my server was 80% due to the problems with the DAS array (yes, it was really down that often).

By migrating the SQL Server stuff to the storage cluster my reliability went UP instead of down (which I understand from past discussions with you that reliability usually will down when things are not properly planned / implemented).

After the move, I would estimate our server had a 95 to 99% uptime with much fewer unplanned outages. I would call a 15% increase in uptime significant.

scottalanmiller

@dafyre said:

Okay. We will assume that the SA for my server was 80% due to the problems with the DAS array (yes, it was really down that often).

SA is not a rate "for you" it is an industry rate. Roughly five nines. Your setup was the stock, well established LA design. So that you got LA rates out of it isn't surprising. But that it was below 99% is pretty surprising.

scottalanmiller

@dafyre said:

By migrating the SQL Server stuff to the storage cluster my reliability went UP instead of down (which I understand from past discussions with you that reliability usually will down when things are not properly planned / implemented).

Of course it went up. You went from a system that was obviously broken to one that at least was working, right?

But you've still not come close to SA, let alone HA. That you came back from the brink of disaster doesn't imply that you are in good shape.

In theory, you could run to the store, buy a nice server for $25K (HP, Dell, Oracle, IBM, etc.), move everything to it, throw out every server you have today, the SAN, the cluster, everything. And have nothing but one single server and a backup system (not a failover, just something to take backups) and shoot from 99% uptime to 99.999% uptime.

And that's just the 1000x improvement to get to SA. Imagine if we got you to HA!!

scottalanmiller

@dafyre said:

After the move, I would estimate our server had a 95 to 99% uptime with much fewer unplanned outages. I would call a 15% increase in uptime significant.

A significant improvement over a known failed state. But nowhere near operating "at par" with having done nothing at all, right? If you didn't do any of this clustering, SANs, extra servers, etc, you would be in far, far better shape.

So you are seeing a 15% improvement over "failure." But you are failing to look at where you are compared to SA, which is at 10,000% higher fail rates.

Do you see what's wrong here? You are comparing against something that you should not compare against. Who cares that you improved over where you were? The question is, why did do much equipment get deployed and you haven't gotten to where you should be?

dafyre

@scottalanmiller said:

And that's a single SAN, no failover at all.

Right but let's compare apples to apples. Our SAN was not a single storage device. It was more akin to a storage cluster.

A two server cluster, no SAN, no DAS, no NAS, alone should blow the doors off of six nines reliably.

Definitely agree with you there. And our VMware servers were indeed highly reliable. At the time when we were migrating everything to the SAN, I am unsure if VMware offered replication to a second server or not (I don't really remember). I think we started with ESXi 4.0.

scottalanmiller

@dafyre said:

@scottalanmiller said:

And that's a single SAN, no failover at all.

Right but let's compare apples to apples. Our SAN was not a single storage device. It was more akin to a storage cluster.

What was it? What made it more than one device? And if it was a cluster and it was that bad, doesn't that make things worse?

dafyre

@scottalanmiller said:

Granted, you've improved. But not to an industry baseline rate. The improvement, if you are really only getting to two nines or even three, only appears as a win because you are approaching from a very low bar perspective. You've come back from 20% downtime down to 1%, but why is the business seeing any measurable downtime at all still?

Granted these are not empirical mathematical calculations. They are guestimates based on experience. As for why we still had down time? Acts of God. Acts of drunk idiots behind the wheel. Acts of whoopsies with a backhoe. The biggest one was power outages lasting longer than our UPSes could hold the servers up. (That was a whole other issue).

scottalanmiller

@dafyre said:

Definitely agree with you there. And our VMware servers were indeed highly reliable. At the time when we were migrating everything to the SAN, I am unsure if VMware offered replication to a second server or not (I don't really remember). I think we started with ESXi 4.0.

They did. But remember, a VMware server can't "be HA." They have a product called HA, but in no way does it suggest that you have HA just because you turn it on. It's a tool only.

If you had HA at one point, why did you go to the SAN and give up HA?

scottalanmiller

@dafyre said:

Granted these are not empirical mathematical calculations. They are guestimates based on experience. As for why we still had down time? Acts of God. Acts of drunk idiots behind the wheel. Acts of whoopsies with a backhoe. The biggest one was power outages lasting longer than our UPSes could hold the servers up. (That was a whole other issue).

Oh, we are talking about system downtime, not downtime outside of the system. Stay focused Server uptime is measured by the server itself staying online.

Now somethings, like the power going out, is part of HA. Long before you talk clusters you should be talking UPS and generators. Those are fundamental starting points long, long before you start modifying the IT gear as the big downtimes come from power, Internet, etc.

Sounds like the cart driving the horse to some degree. Someone thought that SANs sounded cool and put the generator money into technology instead of the things needed to keep that technology online?

SA, in saying that the servers are well treated, assumes enterprise datacenter with UPS, generators, quality HVAC and solid temperature control, low vibration, etc. The kind of stuff you can get easily, but takes effort.

dafyre

@scottalanmiller said:

Oh, we are talking about system downtime, not downtime outside of the system. Stay focused Server uptime is measured by the server itself staying online.

/me concentrates really hard!

Now somethings, like the power going out, is part of HA. Long before you talk clusters you should be talking UPS and generators. Those are fundamental starting points long, long before you start modifying the IT gear as the big downtimes come from power, Internet, etc.

Sounds like the cart driving the horse to some degree. Someone thought that SANs sounded cool and put the generator money into technology instead of the things needed to keep that technology online?

Ha ha ha. Mighty close. However, we did tell them (the bean counters) that we would need a generator to keep things online, and they said "No", just stick with the UPSes. That move was as much of a political thing as it was a money thing.

SA, in saying that the servers are well treated, assumes enterprise datacenter with UPS, generators, quality HVAC and solid temperature control, low vibration, etc. The kind of stuff you can get easily, but takes effort.

We can check the box on UPS, Quality HVAC (that was able to keep the room at 72 even in the case of a main AC failure), temperature control, and low vibration.

*NB: I am still talking about the setup as it was when things were initially done.

scottalanmiller

@dafyre said:

Ha ha ha. Mighty close. However, we did tell them (the bean counters) that we would need a generator to keep things online, and they said "No", just stick with the UPSes. That move was as much of a political thing as it was a money thing.

Then, hopefully, the comeback is "if you don't want to even remotely talk about reliability, why are you spending all this money where it does no good?"

Or "what is the point of IT is arbitrary IT decisions are made without IT oversight?"