Cart before the Horse with RPO and RTO - Growing Core Infrastructure with the Company

NetworkNerd

I work for a growing manufacturing company. We had 1 location when I started in 2007 and was the only one in IT. Now we have 4 members of the IT Department (all stationed at our corporate headquarters in Fort Worth) and a total of 10 sites to support (one of these is currently not yet fully operational). In the next 6 months, two of the existing sites will move into the site I mentioned as not yet being fully operational. But we will still have a total of 10 sites to support as I believe we will be adding a couple more by the end of the year.

In 2012 we finally started down the path of virtualizing our servers and put in 1 ESXi host with all local storage (about 2 TB of it on 10K SAS drives). Then, we added another ESXi host (with similar specs but better processors) in 2013 to finally virtualize our ERP system. We decided to make sure we had two servers with enough processor power, RAM, and IOPs to run everything in the event that one host died. We decided to put half the VMs on each host and use Veeam for backups. At the time the second server went in we were told that 2-4 hours would be fine as a RTO, and the RPO of the previous day's backup would be fine. Backups are taken offsite daily as well. So we left things as local storage with VSphere Essentials, which is what we still have today.

Since those two servers were put in, the number of users has grown, the number of VMs has grown, and the amount of storage in use has grown. We're at the point where neither server would have enough storage to run all VMs from the other server if one of them failed. That's a problem and creates an interesting DR situation.

Let's take servers at remote sites out of this discussion for right now. Some of the remote sites have / will have servers, but the core applications are hosted at HQ. The focus of this post is on those core applications which would be applicable to all sites.

The core applications in my mind hosted by HQ would be as follows (with the most critical of these being anything used by every single site):

Epicor ERP system (servicing all sites) - comprised of SQL server and 2 application servers
Exchange 2010 (servicing all sites) - left on site due to ITAR regulations
Web server (servicing all sites) - corporate access to ERP system data, contains many enhancements for production flow, used for electronic scheduling boards in the shops at almost every location
Bartender server - for label printing at many of our sites
Elastix server - PBX for most of our sites (but not all of them)
Sharepoint Foundation - contains Quality Management system data for most sites
Domain controllers (servicing all sites) - 2 of these virtual, 2 still physical
Solidworks ePDM system (servicing the site with largest revenue) - 2 servers
Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites
File server with part programs (servicing a couple of sites) - part programs for machines stored here
Veeam server - for backup and restores

We're also getting to the point where the shops are using less and less paper. That means when a cutting operation is finished, the operator will eventually have no paper to tell him / her what to do next and must rely on an electronic scheduling board. If those don't work, we cannot make parts, and we lose money. Some cutting / bending operations could take hours before they would need to look at what comes next or transact with our ERP system, download the next part program, etc.

I started a conversation with the boss yesterday and mentioned with the push to go paperless, I was concerned about infrastructure (probably the only one who is concerned) not being resilient enough to hit our RTO. He said he thought a couple of hours of downtime was still ok, but as the downtime becomes longer, the more it will cost the company in lost manufacturing time.

But it won't just cost HQ, it will cost every single company under our corporate umbrella. Each remote site operates under its own name but has the same ownership and executive management team.

He basically said to put together what I think we need to do and how much it will cost. But I threw it back at him and said I really needed to know the cost of downtime to put together a solution they would actually approve and that would meet the needs of the business. It could be something as simple as getting another host that serves as a replication target for the other two production servers so we can flip the switch and turn on critical VMs should a host die. Or it could be something as fancy as throwing in a VSAN cluster and going up to VSphere Essentials Plus or even Standard.

When I mentioned the need to know cost of downtime, he suggested rather than talk to folks in operations and try to get them to ballpark that for us, we should talk to the execs about where we are now, how long it would take us to recover with what we have, and present possible solutions and costs to close that gap. But again, without knowing the cost of downtime, it's kind of like shooting in the dark to some extent. The more remote sites / companies are using our ERP system, the more critical it becomes, driving the cost of downtime up. And if the execs don't see dollars lost, they are less likely to shell out for much.

We do have a couple of spare HP servers from 2008 that might work to run ESXi, but neither has enough storage to run critical VMs should we have an issue with a host. And even if they did, would you rely on a host you bought in 2008 to run the latest version of VSphere and have it work properly to run VMs that are critical to your business should you be presented with a disaster? In my mind if we had a host fail, Id be messing with our older servers to see if they work first and then heading to a Fry's or MicroCenter to buy a small server to help us recover, which may or may not run ESXi. I know for a fact the boss will ask why we can't use that old equipment for something.

I'm not sure what I am looking for here in terms of a response. I think I had the right approach to get cost of downtime and to be prepared with a feasible solution based on RTO, RPO, and cost of downtime. I'd love to hear thoughts from anyone out there who wants to contribute.

JaredBusch

You definitely do not need to know cost of downtime to work on solutions. You will need to know that in order to present which option will best fit the needs of the business.

I know you are invested in hardware already, but your workload sounds like a really nice fit for a Scale cluster. I would most certainly look very hard at that before jumping up your VMWare subscriptions.

I would look at the systems and see if memory and drives can feasibly be added.Something along the lines of dropping all the 10K drives for larger NL SAS drives and then getting some SSD in RAID5 for your high speed needs.

Obviously with multiple arrays you need to plan a little more carefully on your VM locations, but that is a really good choice that should allow you to keep your existing hardware mostly in tact.

scottalanmiller

@NetworkNerd said:

When I mentioned the need to know cost of downtime, he suggested rather than talk to folks in operations and try to get them to ballpark that for us, we should talk to the execs about where we are now, how long it would take us to recover with what we have, and present possible solutions and costs to close that gap.

I remember this conversation from many years ago, when you tried to get the cost of downtime when putting in the current system, they wanted you to act as the financial department and tell the financial people what they should be telling you. Sounds like in four years, they are still not on top of the CFO's office and hoping that IT will fill the gap? They are still missing the basic idea that closing the gap doesn't matter because no one knows what the gap is.

NetworkNerd

Memory can be added to either host without any trouble. The HP DL385 G7 is maxed out at 16 drives (requires all new drives to up the capacity), but the Cisco UCS 240 has only 16 drive bays in use with 8 free. If we added SSDs to the Cisco server and made a datastore just for Epicor VMs, that certainly wouldn't hurt. We'd just have to make sure we got the right type of SSDs for the workload, which is about 80/20 read / write I believe.

scottalanmiller

There is another way to look at this.... IT should be able to produce RPO/RTO numbers based on the infrastructure. Do this and let other departments determine if there is a reason to not like those numbers. IT should not "worry" about this as long as the departments know the facts. If the RTO/RPO is too long, they should come back to you rather than you guessing that the numbers are not good for them. They have the numbers and know if it is so, you are just guessing.

scottalanmiller

@NetworkNerd said:

We do have a couple of spare HP servers from 2008 that might work to run ESXi, but neither has enough storage to run critical VMs should we have an issue with a host. And even if they did, would you rely on a host you bought in 2008 to run the latest version of VSphere and have it work properly to run VMs that are critical to your business should you be presented with a disaster? I

It's a DR scenario. In many cases, yes I would. You would have vSphere installed and ready to go. The server should be regularly tested. That it is from 2008 is pretty minor concern. If it lacks memory or capacity, that is its own concern. But a DR system from 2008 isn't a very big deal on its own.

NetworkNerd

@scottalanmiller said:

@NetworkNerd said:

We do have a couple of spare HP servers from 2008 that might work to run ESXi, but neither has enough storage to run critical VMs should we have an issue with a host. And even if they did, would you rely on a host you bought in 2008 to run the latest version of VSphere and have it work properly to run VMs that are critical to your business should you be presented with a disaster?

It's a DR scenario. In many cases, yes I would. You would have vSphere installed and ready to go. The server should be regularly tested. That it is from 2008 is pretty minor concern. If it lacks memory or capacity, that is its own concern. But a DR system from 2008 isn't a very big deal on its own.

You make a good point. I'd say without VSphere on them right now and confirmation that all drives and components are functional, that adds an hour or two to RTO off the bat (far more if the spares won't work) before I can even start restoring VMs.

But if we replicated the critical VMs to those hosts with Veeam even once per day (assuming we have enough storage - one of them has very little while the other has around 1 TB), turning on the replicas wouldn't take long if a host tanked.

scottalanmiller

@NetworkNerd said:

You make a good point. I'd say without VSphere on them right now and confirmation that all drives and components are functional, that adds an hour or two to RTO off the bat (far more if the spares won't work) before I can even start restoring VMs.

But if we replicated the critical VMs to those hosts with Veeam even once per day (assuming we have enough storage - one of them has very little while the other has around 1 TB), turning on the replicas wouldn't take long if a host tanked.

Even without replicating the data (although if you have the capacity, that's best since there appears to be no additional cost involved) just having vSphere installed, updated and tested would save a lot.

NetworkNerd

@scottalanmiller said:

@NetworkNerd said:

When I mentioned the need to know cost of downtime, he suggested rather than talk to folks in operations and try to get them to ballpark that for us, we should talk to the execs about where we are now, how long it would take us to recover with what we have, and present possible solutions and costs to close that gap.

I remember this conversation from many years ago, when you tried to get the cost of downtime when putting in the current system, they wanted you to act as the financial department and tell the financial people what they should be telling you. Sounds like in four years, they are still not on top of the CFO's office and hoping that IT will fill the gap? They are still missing the basic idea that closing the gap doesn't matter because no one knows what the gap is.

I think someone can help quantify the gap, and I'm hoping I can find that individual and get them to help me. I seem to be the one who is most interested in gap insurance.

scottalanmiller

@NetworkNerd said:

I think someone can help quantify the gap, and I'm hoping I can find that individual and get them to help me. I seem to be the one who is most interested in gap insurance.

This should be the red flag. Why is IT driving financial decisions? It should not. It should be a partner is helping meet operational and financial goals, but it should not be taking over the role of CFO and telling the business how to run. If the CEO does not share your concern, it means that your concern is not aligned with the business.

This is what I call "AJ-ism", and it is pretty common in IT. AJ got a little famous for this because he went beyond "concern" to literally being willing to lose his job fighting the business over trying to make it do what it did not agree needed to be done. Unless your fight is for ethics or safety, IT should not be taking the lead here, at all.

If factors change, IT reminding people that RTO has expanded due to load changes, presenting new costs because things have changed or whatever is one thing. But trying to convince the owners that their financial planning isn't as good as yours and that you should be driving the financial decisions of the company is a fundamentally wrong course for IT. If this is even slightly the case, you should be in the CFO's office running finance, not in IT, because you'd be far more valuable there.

scottalanmiller

Look at it another way... your boss wants to keep you happy. But they just told you, flat out, that your concern isn't important enough to the business for them to even be willing to give you the necessary numbers to figure out another course of action. To me, it sounds like they politely shot down your project. If you come back with "zero spend" options, maybe they will like that. Probably they will like that. But if sounds like if money needs to be spent, they have told you that they don't really want to talk about it.

stacksofplates

@NetworkNerd said:

Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites

This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.

Same with the Elastix server. At least in a DR scenario this would still be running.

NetworkNerd

@johnhooks said:

@NetworkNerd said:

Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites

This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.

Same with the Elastix server. At least in a DR scenario this would still be running.

They certainly could. Those are definitely good suggestions that take the heat off the infrastructure at HQ. I'll have to see what pricing is like to do that. Thanks.

Jason

@johnhooks said:

@NetworkNerd said:

Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites

This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.

That wouldn't make sense. At least not for the reasons you state. Remember there needs to be valid business reasons for doing this. In the Case of DR it's to provide business continuity. What business disruption will be caused if the unifi controller is down?

stacksofplates

@Jason said:

@johnhooks said:

@NetworkNerd said:

Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites

This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.

That wouldn't make sense. At least not for the reasons you state. Remember there needs to be valid business reasons for doing this. In the Case of DR it's to provide business continuity. What business disruption will be caused if the unifi controller is down?

There most likely wouldn't be a disruption, but it's one less thing to worry about in a DR situation and it helps with this issue

Since those two servers were put in, the number of users has grown, the number of VMs has grown, and the amount of storage in use has grown. We're at the point where neither server would have enough storage to run all VMs from the other server if one of them failed. That's a problem and creates an interesting DR situation.

It could most likely be run on the lowest tier DO or Vultr server for ~$5 a month. It was also listed as a core application, so having that off site if possible would be better.

Jason

@johnhooks said:

@Jason said:

@johnhooks said:

@NetworkNerd said:

Unifi controller VM (servicing all sites) - 1 VM as controller for APs across all sites

This probably won't make a big difference at all, but could this be put on a hosted VM somewhere? That would at least alleviate restoring this VM if something happens.

That wouldn't make sense. At least not for the reasons you state. Remember there needs to be valid business reasons for doing this. In the Case of DR it's to provide business continuity. What business disruption will be caused if the unifi controller is down?

There most likely wouldn't be a disruption, but it's one less thing to worry about in a DR situation and it helps with this issue

You wouldn't be worrying about something that provides no business continuity in a DR situation. That would be something you can deal with much later. The focus should be on things that directly have a monetary impact on the business.

scottalanmiller

@johnhooks said:

There most likely wouldn't be a disruption, but it's one less thing to worry about in a DR situation and it helps with this issue

No disruption means no DR worry. Paying for DR facilities to eliminate IT effort that has no business impact cost is very hard to justify. It's a difficult conversation to have with the CEO "Well, we are going to pay for this extra project and monthly cost so that I don't have to do as much work." There are cases where the work reduction really does justify that, but this doesn't feel like one of those.

stacksofplates

It's also now removed off of two servers that are already over committed. With both it and elastix gone, that frees up resources for something else. While both VMs are minimal it could still help.

stacksofplates

@scottalanmiller said:

@johnhooks said:

There most likely wouldn't be a disruption, but it's one less thing to worry about in a DR situation and it helps with this issue

No disruption means no DR worry. Paying for DR facilities to eliminate IT effort that has no business impact cost is very hard to justify. It's a difficult conversation to have with the CEO "Well, we are going to pay for this extra project and monthly cost so that I don't have to do as much work." There are cases where the work reduction really does justify that, but this doesn't feel like one of those.

No worry for that, but there is still worry about the other systems that won't all fit on a single server. It may be minimal, but it's still freeing up resources. And it's only $5 a month.

scottalanmiller

@johnhooks said:

It's also now removed off of two servers that are already over committed. With both it and elastix gone, that frees up resources for something else. While both VMs are minimal it could still help.

True, that helps with capacity a little. Although VERY little, we assume.