Resurrecting the Past: a Project Post Mortem
-
What kind of workloads are we supporting here? AD I assume, SQL Server. Everything is on Windows? Any special systems needing special considerations? Which ones lack application layer fault tolerance?
-
From your description, it sounds like the business is not averse to outages, just wants to avoid dataloss and, of course, reduced downtime is always a plus. Is small amounts of downtime in case of a disaster acceptable?
For example, FT, HA and Async Failover take you from zero downtime to moments of downtime to minutes of downtime, but the cost and complexity drops dramatically and performance increases as you go left to right.
-
@scottalanmiller said:
What kind of workloads are we supporting here? AD I assume, SQL Server. Everything is on Windows? Any special systems needing special considerations? Which ones lack application layer fault tolerance?
At that time, the only thing with application-layer fault tolerance was AD. We have a Linux box that is for DHCP. There were 2 x Web Servers, 2 x Moodle Servers, 2 x Sharepoint Servers (separate pieces of the project), and the Standalone SQL Server (not virtualized... it was both virtualized and clustered later), a CLEP Testing Server, an IT Utility Server, and a few others. Several things were consolidated into the same server where it made sense... IE: our IT Utility server was also the OpenFire server, as well as our Spiceworks server (am I allowed to say that here? lol).
-
@scottalanmiller said:
From your description, it sounds like the business is not averse to outages, just wants to avoid dataloss and, of course, reduced downtime is always a plus. Is small amounts of downtime in case of a disaster acceptable?
That is probably an accurate description for the business back then. However, we in IT were tired of having to do restores and such two or three times a month. The business office told us they wanted us to keep things functional as much as possible, and fit it into $budget.
When we originally started the virtualization project, everything was running on a MD1000. That is the point at which we quit looking at DAS and started looking at SAN.
-
@scottalanmiller said:
For example, FT, HA and Async Failover take you from zero downtime to moments of downtime to minutes of downtime, but the cost and complexity drops dramatically and performance increases as you go left to right.
We did mostly HA after we got approval for the storage cluster (SAN, if I must call it that, lol). Our servers were hefty enough that one could absorb the VMs from the other if a host failed. Our storage cluster was truly HA -- we could lose either node and nobody would notice. We had 2 x fiber lines between the two buildings that took different paths. We did not have redundant network switches, as that was an oversight on our part, thus causing the mostly HA, lol.
Fortunately for us, there was never any unplanned down time from a failed switch.
-
@dafyre said:
That is probably an accurate description for the business back then. However, we in IT were tired of having to do restores and such two or three times a month.
How was that happening? That's a different issue. Even with no HA of any sort, a restore once a year should be a huge red flag. What was causing all of these restores?
-
@dafyre said:
Our storage cluster was truly HA -- we could lose either node and nobody would notice.
That doesn't describe HA. That describes redundancy and failover. It suggests an attempt at HA but in no way means that it was achieved. I can make something that is really fragile, is super cheap and is low availability (lower than a standard server) but fails over when I do a node test.
This redundancy instead of HA is actually what SAN vendors are famous for. Dell and HP, especially, have a track record of making entry level devices that are "fully redundant" but are LA, not SA or HA. They can demonstrate a failover anytime you want, but when they fail they rarely successfully failover and, in fact, fail far more often than a normal server.
So very careful to know how you determined that this setup is HA. Being that the nodes are in different physical locations makes that far more likely than if they were in a single chassis, but your description in no way suggests that there is HA without more details.
-
The posts about the Dell MD SAN and the EMC VNX SAN from a few hours ago are both examples of this. In both cases someone would demonstrate how yanking one of the controllers out would cause a fault and the other controller would transparently take over. In the case of a controller being yanked, they are pretty reliable. But in the real world, that's not what a failure looks like and typically failures are caused by firmware issues, not physical removal. And physical failures and physical removal are very different things as well. You rarely get split brain during a test, or have a shock to the system.
-
@dafyre said:
When we originally started the virtualization project, everything was running on a MD1000. That is the point at which we quit looking at DAS and started looking at SAN.
Only difference between a DAS and a SAN is the addition of the switching layer. DAS is always superior to SAN, all other things being equal, because it has lower latency and fewer points of failure. There are hard limits as to how many physical hosts can attach to any given DAS, but this can easily be in the hundreds. But a DAS and SAN are the same physical devices, just determined by how they are connected.
But when the option exists, you never choose a SAN unless a DAS can't do what you need physically as the DAS is simply faster and safer by the laws of physics.
-
@scottalanmiller said:
That doesn't describe HA. (snip)
How would you define HA, then? I mean is not the scenario I just described (ie: keeping the systems up despite a major failure?) a form of HA. Yes it is failover, but isn't failover (and failback) a part of HA? Isn't that the point of having a failover cluster?
NB: This was a 2-node + witness active / active cluster.
This redundancy instead of HA is actually what SAN vendors are famous for. Dell and HP, especially, have a track record of making entry level devices that are "fully redundant" but are LA, not SA or HA. They can demonstrate a failover anytime you want, but when they fail they rarely successfully failover and, in fact, fail far more often than a normal server.
Let's call it redundancy, then... our uptime went from 80% to 99% (guestimate based on experience). We were still in a much better situation than we were before. The offices that lost money while we were doing restores were no longer suffering from anywhere near as many interuptions from our servers being down. (This was before I learned a little more business sense to calculate the cost of down time, etc. We presented our solution to the business & financial folks, and they said do it. So we did).
@scottalanmiller said:
How was that happening? That's a different issue. Even with no HA of any sort, a restore once a year should be a huge red flag. What was causing all of these restores?
To this day, we are not sure. I think it was a faulty controller or something. After we got the SAN ^H^H^H storage cluster installed, we move the SQL Server's database files, etc, etc. off of the PowerVault and never looked back. (This was after the PV was out of warranty).
-
@dafyre said:
How would you define HA, then?
Pretty easy, actually. HA = High Availability. So let's start be defining Standard Availability.
What is SA? The most common accepted example of SA is to take a well treated, enterprise commodity server (in the case of systems HA like we are discussing here) and treat that as the baseline. This would generally be the big, mainline servers such as an HP DL380 or a Dell R730. These are the best selling servers in the world, and they are the standard "middle of the road" servers from the enterprise vendors. These are neither the entry level devices nor special case devices. These are the bulk of sales, the standard against which all others are measured and designed for general purpose computing.
So: SA can be defined as the availability of a system consisting of a single, properly maintained enterprise commodity server.
Therefore, what is HA? HA is then simple to describe: High availability is a system that significantly improves upon the availability of an SA system.
Being 1% better isn't enough. And it doesn't matter how it is achieved. The terms give it all away - it is about availability and nothing else. Any other aspect is misdirection and a red herring. IT departments and vendors tend to focus on the "other things" because they are simple. Is this redundant? Simple yes or no. But is it reliable? Oh, I'd have to think about that.
-
@dafyre said:
@scottalanmiller said:
I mean is not the scenario I just described (ie: keeping the systems up despite a major failure?) a form of HA. Yes it is failover, but isn't failover (and failback) a part of HA?
Nothing you described clearly isn't HA, but nothing you described addresses availability at all. You are talking about technical, under the hood things without actually looking at availability of the system.
So no, what you describe is not a "form of HA." HA doesn't have forms, it just is or isn't. HA isn't a thing that you can hold, it is a rate of availability higher than normal.
Failover, fault tolerance, etc. are things people often use to achieve HA, but they are tools, not results. A hammer is a tool to build a house, but buying a hammer doesn't mean you have a house. It just means you have a tool to use to make one, if you choose.
-
@dafyre said:
Isn't that the point of having a failover cluster?
No, the primary purpose of a cluster is business politics - to satisfy a checkbox, not to achieve a business goal like availability. Most vendors offer clustering in a low cost (to build) form in order to address checkboxes because checkboxes are easy.
Remember as @John-Nicholson has said: "HA is something that you do, not something that you buy."
There is no way to ever purchase HA. You can you purchase failover systems. So by the nature of being able to buy them, they aren't HA themselves. This doesn't make them bad, they are just tools, not resultant availability ratings.
Seatbelts don't make a car safe, but they are a tool in improving the safety of a car. Make sense?
-
@dafyre said:
Let's call it redundancy, then... our uptime went from 80% to 99% (guestimate based on experience).
Right, you have redundancy. Which I preach over and over again is never a goal. If someone says they want redundancy, they've lost sight of their goals. Redundancy means you have extra of something, it doesn't imply that it protects you.
Now instead of being caught up in terms like HA, redundancy, clustering, etc. we should talk about the real problems.
How did you have an availability of 80% and how did you only get up to 99%? These are both LA numbers, extremely LA numbers.
SA is generally accepted to be between four and five nines (99.99% - 99.999%) availability. You are talking about orders of magnitude less reliable systems here. Seriously, orders of magnitude.
So you have an issue here that is far, far bigger than what we are discussing and should be investigated. NTG sees over ten nines from SA setups, but we treat them REALLY well. And that's many systems over nearly two decades of running. That includes servers from the 1990s without a lot of modern engineering, cooling and redundancy. We are getting extremely high numbers, we know, but it is important to note that if your clusters aren't getting you into the six nines categories with ease, something is likely very wrong.
-
@dafyre said:
We were still in a much better situation than we were before. The offices that lost money while we were doing restores were no longer suffering from anywhere near as many interuptions from our servers being down.
Granted, you've improved. But not to an industry baseline rate. The improvement, if you are really only getting to two nines or even three, only appears as a win because you are approaching from a very low bar perspective. You've come back from 20% downtime down to 1%, but why is the business seeing any measurable downtime at all still?
-
@dafyre said:
To this day, we are not sure. I think it was a faulty controller or something. After we got the SAN ^H^H^H storage cluster installed, we move the SQL Server's database files, etc, etc. off of the PowerVault and never looked back. (This was after the PV was out of warranty).
So using a SAN is believed to have induced a dramatic LA situation. Using a SAN it is assumed that LA is going to be the result, how could it not? But when we talk about a SAN pushing you into LA (low availability, significantly below the availability of a single server) we are normally assuming three nines at least, 99.9% uptime.
And that's a single SAN, no failover at all.
A two server cluster, no SAN, no DAS, no NAS, alone should blow the doors off of six nines reliably.
-
@scottalanmiller said:
@dafyre said:
@scottalanmiller said:
I mean is not the scenario I just described (ie: keeping the systems up despite a major failure?) a form of HA. Yes it is failover, but isn't failover (and failback) a part of HA?
Nothing you described clearly isn't HA, but nothing you described addresses availability at all. You are talking about technical, under the hood things without actually looking at availability of the system.
So no, what you describe is not a "form of HA." HA doesn't have forms, it just is or isn't. HA isn't a thing that you can hold, it is a rate of availability higher than normal.
Failover, fault tolerance, etc. are things people often use to achieve HA, but they are tools, not results. A hammer is a tool to build a house, but buying a hammer doesn't mean you have a house. It just means you have a tool to use to make one, if you choose.
I'll agree with what you said above.
Okay. We will assume that the SA for my server was 80% due to the problems with the DAS array (yes, it was really down that often).
By migrating the SQL Server stuff to the storage cluster my reliability went UP instead of down (which I understand from past discussions with you that reliability usually will down when things are not properly planned / implemented).
After the move, I would estimate our server had a 95 to 99% uptime with much fewer unplanned outages. I would call a 15% increase in uptime significant.
-
@dafyre said:
Okay. We will assume that the SA for my server was 80% due to the problems with the DAS array (yes, it was really down that often).
SA is not a rate "for you" it is an industry rate. Roughly five nines. Your setup was the stock, well established LA design. So that you got LA rates out of it isn't surprising. But that it was below 99% is pretty surprising.
-
@dafyre said:
By migrating the SQL Server stuff to the storage cluster my reliability went UP instead of down (which I understand from past discussions with you that reliability usually will down when things are not properly planned / implemented).
Of course it went up. You went from a system that was obviously broken to one that at least was working, right?
But you've still not come close to SA, let alone HA. That you came back from the brink of disaster doesn't imply that you are in good shape.
In theory, you could run to the store, buy a nice server for $25K (HP, Dell, Oracle, IBM, etc.), move everything to it, throw out every server you have today, the SAN, the cluster, everything. And have nothing but one single server and a backup system (not a failover, just something to take backups) and shoot from 99% uptime to 99.999% uptime.
And that's just the 1000x improvement to get to SA. Imagine if we got you to HA!!
-
@dafyre said:
After the move, I would estimate our server had a 95 to 99% uptime with much fewer unplanned outages. I would call a 15% increase in uptime significant.
A significant improvement over a known failed state. But nowhere near operating "at par" with having done nothing at all, right? If you didn't do any of this clustering, SANs, extra servers, etc, you would be in far, far better shape.
So you are seeing a 15% improvement over "failure." But you are failing to look at where you are compared to SA, which is at 10,000% higher fail rates.
Do you see what's wrong here? You are comparing against something that you should not compare against. Who cares that you improved over where you were? The question is, why did do much equipment get deployed and you haven't gotten to where you should be?