Resurrecting the Past: a Project Post Mortem
-
@dafyre said:
After the move, I would estimate our server had a 95 to 99% uptime with much fewer unplanned outages. I would call a 15% increase in uptime significant.
A significant improvement over a known failed state. But nowhere near operating "at par" with having done nothing at all, right? If you didn't do any of this clustering, SANs, extra servers, etc, you would be in far, far better shape.
So you are seeing a 15% improvement over "failure." But you are failing to look at where you are compared to SA, which is at 10,000% higher fail rates.
Do you see what's wrong here? You are comparing against something that you should not compare against. Who cares that you improved over where you were? The question is, why did do much equipment get deployed and you haven't gotten to where you should be?
-
@scottalanmiller said:
And that's a single SAN, no failover at all.
Right but let's compare apples to apples. Our SAN was not a single storage device. It was more akin to a storage cluster.
A two server cluster, no SAN, no DAS, no NAS, alone should blow the doors off of six nines reliably.
Definitely agree with you there. And our VMware servers were indeed highly reliable. At the time when we were migrating everything to the SAN, I am unsure if VMware offered replication to a second server or not (I don't really remember). I think we started with ESXi 4.0.
-
@dafyre said:
@scottalanmiller said:
And that's a single SAN, no failover at all.
Right but let's compare apples to apples. Our SAN was not a single storage device. It was more akin to a storage cluster.
What was it? What made it more than one device? And if it was a cluster and it was that bad, doesn't that make things worse?
-
@scottalanmiller said:
Granted, you've improved. But not to an industry baseline rate. The improvement, if you are really only getting to two nines or even three, only appears as a win because you are approaching from a very low bar perspective. You've come back from 20% downtime down to 1%, but why is the business seeing any measurable downtime at all still?
Granted these are not empirical mathematical calculations. They are guestimates based on experience. As for why we still had down time? Acts of God. Acts of drunk idiots behind the wheel. Acts of whoopsies with a backhoe. The biggest one was power outages lasting longer than our UPSes could hold the servers up. (That was a whole other issue).
-
@dafyre said:
Definitely agree with you there. And our VMware servers were indeed highly reliable. At the time when we were migrating everything to the SAN, I am unsure if VMware offered replication to a second server or not (I don't really remember). I think we started with ESXi 4.0.
They did. But remember, a VMware server can't "be HA." They have a product called HA, but in no way does it suggest that you have HA just because you turn it on. It's a tool only.
If you had HA at one point, why did you go to the SAN and give up HA?
-
@dafyre said:
Granted these are not empirical mathematical calculations. They are guestimates based on experience. As for why we still had down time? Acts of God. Acts of drunk idiots behind the wheel. Acts of whoopsies with a backhoe. The biggest one was power outages lasting longer than our UPSes could hold the servers up. (That was a whole other issue).
Oh, we are talking about system downtime, not downtime outside of the system. Stay focused Server uptime is measured by the server itself staying online.
Now somethings, like the power going out, is part of HA. Long before you talk clusters you should be talking UPS and generators. Those are fundamental starting points long, long before you start modifying the IT gear as the big downtimes come from power, Internet, etc.
Sounds like the cart driving the horse to some degree. Someone thought that SANs sounded cool and put the generator money into technology instead of the things needed to keep that technology online?
SA, in saying that the servers are well treated, assumes enterprise datacenter with UPS, generators, quality HVAC and solid temperature control, low vibration, etc. The kind of stuff you can get easily, but takes effort.
-
@scottalanmiller said:
Oh, we are talking about system downtime, not downtime outside of the system. Stay focused Server uptime is measured by the server itself staying online.
/me concentrates really hard!
Now somethings, like the power going out, is part of HA. Long before you talk clusters you should be talking UPS and generators. Those are fundamental starting points long, long before you start modifying the IT gear as the big downtimes come from power, Internet, etc.
Sounds like the cart driving the horse to some degree. Someone thought that SANs sounded cool and put the generator money into technology instead of the things needed to keep that technology online?
Ha ha ha. Mighty close. However, we did tell them (the bean counters) that we would need a generator to keep things online, and they said "No", just stick with the UPSes. That move was as much of a political thing as it was a money thing.
SA, in saying that the servers are well treated, assumes enterprise datacenter with UPS, generators, quality HVAC and solid temperature control, low vibration, etc. The kind of stuff you can get easily, but takes effort.
We can check the box on UPS, Quality HVAC (that was able to keep the room at 72 even in the case of a main AC failure), temperature control, and low vibration.
*NB: I am still talking about the setup as it was when things were initially done.
-
@dafyre said:
Ha ha ha. Mighty close. However, we did tell them (the bean counters) that we would need a generator to keep things online, and they said "No", just stick with the UPSes. That move was as much of a political thing as it was a money thing.
Then, hopefully, the comeback is "if you don't want to even remotely talk about reliability, why are you spending all this money where it does no good?"
Or "what is the point of IT is arbitrary IT decisions are made without IT oversight?"
-
One thing I will mention, since you like to hear the business side of things as well... We were doing this with the goals that the administration had set before us:
- Keep live data in 2 locations -- *check, done with Storage Cluster
- Keep systems up as much as possible -- * check, done with Storage Cluster, VMware features, and Windows Failover Clustering
We made suggestions for having a good generator installed , but were shot down repeatedly. A lot the shoot-downs involved a lot of high-level politics involved that I just didn't want to get into (I hate politics. Just tell me what needs to be done, and let me get help getting it done).
The decisions were made by the IT team not just me. So the 4 or 5 of us liked the solution that we picked, and liked it even more after we saw it in action.
-
@scottalanmiller said:
Or "what is the point of IT is arbitrary IT decisions are made without IT oversight?"
There was still a lot of that going on at the time. IT was shown $product and told to make it work with $other_product.... Some times this was possible, and others, it was not.
Fortunately after the fire disaster and we got things setled in with the SAN, there were few IT decisions made without IT involvement. We made things noticeably better for the campus, so they realized that we weren't terribly stupid.
-
So, for a modern deployment, it sounds like the system is small enough that likely you could go down to two nodes, no external storage, and get full failover, even higher reliability through a reduction of failure points and simplification of design. Cost savings, of course, as you only need two nodes. Performance increase by reducing bottlenecks.
HyperV and Starwinds do this really well, without even the need for node licensing of any sort!
-
We have all Equallogic SANs here. Mostly because it was proven that buying many cheap EQ SANs and planning on them to fail was better than buying less more expensive EMC etc. SANs. But we also have these and many sites replicating, plus AppAssure and then Azure SAN Cloud Replication. (Azure is a major part of our DR).
-
I'm curious how many people actually need a SAN that have them. We went without one at the county with under 10 servers. The Town we had one but, they liked to waste IT budget and mostly used the SAN as a file server which made no sense.
-
The Modern Deployment as I left it:
3gig fiber link to a new server room, backed by both UPS and generator ( ). 1gbit reduntant fiber link, 4 node Scale HC3x cluster (servers w/64GB of RAM) and 7.2GB of storage. HP SAN is mostly retired and VMware servers are no longer in production.
SQL Servers are now virtualized and clustered. I think there are now only 3 physical servers left, and the current team is working to finish that off. There are now 30 VMs, as we have separated roles out heavily so if we need to do windows updates, it only takes out one service.
Things are much more reliable and available (since @scottalanmiller won't let me say they are more HA ) than they have previously been in my 10 years at that job.
-
@thecreativeone91 said:
We have all Equallogic SANs here. Mostly because it was proven that buying many cheap EQ SANs and planning on them to fail was better than buying less more expensive EMC etc. SANs. But we also have these and many sites replicating, plus AppAssure and then Azure SAN Cloud Replication. (Azure is a major part of our DR).
If you are planning on them to fail, and you have enough scale for them to make things cheaper than local storage, it can be a cost saver.
-
@thecreativeone91 said:
I'm curious how many people actually need a SAN that have them. We went without one at the county with under 10 servers. The Town we had one but, they liked to waste IT budget and mostly used the SAN as a file server which made no sense.
I can only imagine that below the enterprise space, it has to be less than 5%. These days the scale of a local storage and small node count deployment is just so big.
-
But now the other question.... what would we have done in 2007? Today is easy, consolidate and hyperconverge. Done. Easy peasy.
In 2007.....
-
So the big questions around 2007, how many hosts would have been consolidated to with ESX back then? That's a starting point.
-
As a small shop, we started with 16 physical servers and got that number down to 6 physical servers, and could have gone lower if not for RAM constraints... So we got 5:1 consolidation right off the bat.
-