Resurrecting the Past: a Project Post Mortem

dafyre

Hi All,

If you've been following the What Are you doing Thead, you know that I had a project in the past that has been discussed. Thsi project has long since been completed and I figured I'd do a post about it and let the community poke at it for a bit and see what other solutions are presented.

Backstory:

We had roughly 16 physical servers at the time (we hadn't done much with virtualization yet). 3 of those servers were earmarked for the new SIS (Student information system) that we were getting ready to implement. One of those was a SQL Server connected to a PowerVault MD 1000 (SATA / SAS conneciton, not iSCSI).

At one point, there was a fire that took out an entire building. Everything that was in the buliding was destroyed, and it was devastating to that department, to say the least. Whispers came up to the administration of a question about what if that had been the building where IT Keeps all our servers (I blame my boss for that one, lol). At this point we talked about hijacking 2 of the SIS servers and doing Virtualization with them. That plan was approved.

So the TL;DR for this project:

Requirements

Servers and Data in two locations on campus
End User files backed up
Servers backed up
We did have good backups for our physical machines
At the time, we estimated 1TB of usage of data, and planned for growth.

What would you have done in this situation? Keep in mind this was done around 2007 or 2008.

If you want more details, just let me know!

scottalanmiller

I think looking at this in two ways would be good. One from a 2007 decision making perspective (as best as we can) as a post mortem. A second from a "what would we do today" perspective to see what a modern recommendation would look like and see how things have changed and provide a real world recommendation for people searching on this.

Good to look both at the process and at the result.

scottalanmiller

@dafyre said:

We had roughly 16 physical servers at the time (we hadn't done much with virtualization yet).

Perspective Modern: Using virtualization and consolidation, down to how many physical hosts could this be collapsed today? Sixteen in 2007 would easily fit into one or two hosts today, I would assume. Was this sixteen for performance or some amount of them for other purposes (separation of duties, failover, etc.)

dafyre

@scottalanmiller said:

@dafyre said:

We had roughly 16 physical servers at the time (we hadn't done much with virtualization yet).

Perspective Modern: Using virtualization and consolidation, down to how many physical hosts could this be collapsed today? Sixteen in 2007 would easily fit into one or two hosts today, I would assume. Was this sixteen for performance or some amount of them for other purposes (separation of duties, failover, etc.)

Response You are correct. We started with 2 x VMware servers and were able to go from 16 Physical servers down to 6.

scottalanmiller

@dafyre said:

Response You are correct. We started with 2 x VMware servers and were able to go from 16 Physical servers down to 6.

Still need six? What's keeping you from two?

dafyre

@scottalanmiller said:

@dafyre said:

Response You are correct. We started with 2 x VMware servers and were able to go from 16 Physical servers down to 6.

Still need six? What's keeping you from two?

We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.

scottalanmiller

@dafyre said:

We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.

At the time, but now? Anything keeping you from two machines today? Considering you can get over a terabyte of memory per machine, is memory an issue?

scottalanmiller

How much memory did you need back then? Even in 2007 you could get an awful lot. I think we were deploying 64GB standard then but could go much larger when needed, but not by default. This was not in an SMB, in a Fortune 100, but per machine it still made sense.

scottalanmiller

I assume you have something like 10GigE available between sites?

dafyre

@scottalanmiller said:

@dafyre said:

We ran out of RAM at the time, and initial testing with SQL Server made it not viable for our environment right then.

At the time, but now? Anything keeping you from two machines today? Considering you can get over a terabyte of memory per machine, is memory an issue?

The machines we had came with 32GB RAM (these machines had already been purchased for another project, but they got repurposed before that project got off the ground)

dafyre

@scottalanmiller said:

I assume you have something like 10GigE available between sites?

No. The buildings are connected via 2 x 1gig fiber. The link utilization never really peaks out except when backups are running.

scottalanmiller

@dafyre said:

The machines we had came with 32GB RAM (these machines had already been purchased for another project, but they got repurposed before that project got off the ground)

That's another issue to be addressed. Why were machines purchased that were not needed (yet) and why are other projects being forced to make due with the rejects of failed projects?

Working with what you have is always something to be considered, but purchasing when not needed is never a good idea.

dafyre

@scottalanmiller said:

Working with what you have is always something to be considered, but purchasing when not needed is never a good idea.

^ That x 1000. These machines were originally waaaaaaaay overkill for the project that we had coming down the pipe. That project was a done deal, signed and delivered. We purchased the machines for that specific project, and then had the fire that took out a building and pushed the administration over the edge with concern about major data loss. We decided that since they were so overkill that the VMs for that project would be our first forray into Virtualization.

scottalanmiller

What kind of workloads are we supporting here? AD I assume, SQL Server. Everything is on Windows? Any special systems needing special considerations? Which ones lack application layer fault tolerance?

scottalanmiller

From your description, it sounds like the business is not averse to outages, just wants to avoid dataloss and, of course, reduced downtime is always a plus. Is small amounts of downtime in case of a disaster acceptable?

For example, FT, HA and Async Failover take you from zero downtime to moments of downtime to minutes of downtime, but the cost and complexity drops dramatically and performance increases as you go left to right.

dafyre

@scottalanmiller said:

What kind of workloads are we supporting here? AD I assume, SQL Server. Everything is on Windows? Any special systems needing special considerations? Which ones lack application layer fault tolerance?

At that time, the only thing with application-layer fault tolerance was AD. We have a Linux box that is for DHCP. There were 2 x Web Servers, 2 x Moodle Servers, 2 x Sharepoint Servers (separate pieces of the project), and the Standalone SQL Server (not virtualized... it was both virtualized and clustered later), a CLEP Testing Server, an IT Utility Server, and a few others. Several things were consolidated into the same server where it made sense... IE: our IT Utility server was also the OpenFire server, as well as our Spiceworks server (am I allowed to say that here? lol).

dafyre

@scottalanmiller said:

From your description, it sounds like the business is not averse to outages, just wants to avoid dataloss and, of course, reduced downtime is always a plus. Is small amounts of downtime in case of a disaster acceptable?

That is probably an accurate description for the business back then. However, we in IT were tired of having to do restores and such two or three times a month. The business office told us they wanted us to keep things functional as much as possible, and fit it into $budget.

When we originally started the virtualization project, everything was running on a MD1000. That is the point at which we quit looking at DAS and started looking at SAN.

dafyre

@scottalanmiller said:

For example, FT, HA and Async Failover take you from zero downtime to moments of downtime to minutes of downtime, but the cost and complexity drops dramatically and performance increases as you go left to right.

We did mostly HA after we got approval for the storage cluster (SAN, if I must call it that, lol). Our servers were hefty enough that one could absorb the VMs from the other if a host failed. Our storage cluster was truly HA -- we could lose either node and nobody would notice. We had 2 x fiber lines between the two buildings that took different paths. We did not have redundant network switches, as that was an oversight on our part, thus causing the mostly HA, lol.

Fortunately for us, there was never any unplanned down time from a failed switch.

scottalanmiller

@dafyre said:

That is probably an accurate description for the business back then. However, we in IT were tired of having to do restores and such two or three times a month.

How was that happening? That's a different issue. Even with no HA of any sort, a restore once a year should be a huge red flag. What was causing all of these restores?

scottalanmiller

@dafyre said:

Our storage cluster was truly HA -- we could lose either node and nobody would notice.

That doesn't describe HA. That describes redundancy and failover. It suggests an attempt at HA but in no way means that it was achieved. I can make something that is really fragile, is super cheap and is low availability (lower than a standard server) but fails over when I do a node test.

This redundancy instead of HA is actually what SAN vendors are famous for. Dell and HP, especially, have a track record of making entry level devices that are "fully redundant" but are LA, not SA or HA. They can demonstrate a failover anytime you want, but when they fail they rarely successfully failover and, in fact, fail far more often than a normal server.

So very careful to know how you determined that this setup is HA. Being that the nodes are in different physical locations makes that far more likely than if they were in a single chassis, but your description in no way suggests that there is HA without more details.