Greenfield HA environment choices

AdamF

@scottalanmiller said in Greenfield HA environment choices:

But if you want support, which is always a good idea in production, just ping @Oksana on here.

Is Max still around? I think I met him at Mangocon last year.

scottalanmiller

@fuznutz04 said in Greenfield HA environment choices:

@scottalanmiller said in Greenfield HA environment choices:

But if you want support, which is always a good idea in production, just ping @Oksana on here.

Is Max still around? I think I met him at Mangocon last year.

Yes, he is.

@Stuka

scottalanmiller

@fuznutz04 said in Greenfield HA environment choices:

Windows VMs with software that is not designed for shared databases, shared web hosts, etc.

Yeah, but you said you were all Linux VMs, right? Or did I misread that?

Pretty much no production software is this way. There is certainly a lot of bad software out there, but if the application isn't HA-able, that alone is generally a big red flag that it's not released at a level IT would consider "production" yet.

IRJ

@fuznutz04 said in Greenfield HA environment choices:

Example:

Windows VMs with software that is not designed for shared databases, shared web hosts, etc.

This is definitely not a modern application. Not one any serious business would consider.

scottalanmiller

@fuznutz04 said in Greenfield HA environment choices:

PBX - Lets say I host a PBX for my company. I want to do maintenance on the node hosting this. I want to do it during the day. I want to live migrate that VM to another node. I don't need application level failover, I just need to move it and not have downtime.

Sure, but how do you update the PBX itself, which is easily 20x more frequent of a need than updating the host.

Also, if you avoid Hyper-V, updating the host rarely means downtime to the VMs. We update our KVM hosts daily.

I think this is a false need, or nearly so. It sounds good, but doesn't really play out in the real world. If you have platform updates, you have more PBX updates and need downtime if you don't have HA.

Also worth noting, even VMware has huge risk to doing this. Production practices state that to do HA failovers at a platform level like this you have to be ready for downtime. Modern systems are way better than they used to be, but if you do this and it goes down (and I've seen Wall St. go down using VMware for this) it's 100% on you for having done something really risky while the system was in use. This is something I'd never do unless there are no calls currently, or there is actual HA to handle it.

It's just one of those situations where... if you need the protection you need the protection, and if you don't you don't. What you are asking about is a niche that falls between all realistic scenarios. It might sound reasonable, but in reality, it's not. Not for this workload at least.

scottalanmiller

@fuznutz04 said in Greenfield HA environment choices:

Any other software/workload that for one reason or another, cannot reasonably be moved to a "true HA" solution.

Right but.... here is the rub...

If their uptime doesn't matter, do you need this platform HA since it's already determined that they don't need this level of protection?

If they do need this uptime, doesn't that need to be addressed?

scottalanmiller

@fuznutz04 so what I'm thinking that I am hearing is, and correct me if I am wrong...

You perceive platform updates as carrying risk and headache.
You want to do platform updates at the riskiest times (prime production) rather than waiting on a greenzone. I assume because you work during prime production and sleep during what would be the greenzone.
You run fragile applications that we generally would not consider production ready and don't care about uptime protection on them in general (but don't want unnecessary downtime, either.)

This is what I think that I am hearing, both in your descriptions of why you want this feature, as well as your perception that the other company's non-HA solution was "great". You are looking at it as a feature to make IT management easier, so the application availability isn't the factor of concern here.

To that I would say that...

Platform updates on systems like KVM are trivially easy, insanely fast, and not a problem at all. You probably only see this as a concern at all because you were running Hyper-V (or VMware) where system updates are problematic. KVM and Xen aren't like this. So if you use them, I think your entire premise for this evaporates.
I would simply not do this. Even if I have HA, I wouldn't do it because prime production time is never the time to test your failover systems. Everything fails and HA is a highly risky operation even under ideal conditions and there is no vendor that is completely reliable. Doing off hours maintenance isn't just trivially easy, but it is easily scripted during production hours. Why take on cost and risk that is unnecessary? Just do an update and reboot off hours.
This is real and we can't easily work around it. But the decision that these would require off hours support for safety was made at the time of acquisition.

Also, the cost to have an MSP do this for you, if you didn't want to do it, would be far cheaper than the cost of HA. And fix the problems way more thoroughly because it would address everything, not just one piece.

AdamF

@scottalanmiller said in Greenfield HA environment choices:

@fuznutz04 so what I'm thinking that I am hearing is, and correct me if I am wrong...

You perceive platform updates as carrying risk and headache.

You want to do platform updates at the riskiest times (prime production) rather than waiting on a greenzone. I assume because you work during prime production and sleep during what would be the greenzone.

You run fragile applications that we generally would not consider production ready and don't care about uptime protection on them in general (but don't want unnecessary downtime, either.)

This is what I think that I am hearing, both in your descriptions of why you want this feature, as well as your perception that the other company's non-HA solution was "great". You are looking at it as a feature to make IT management easier, so the application availability isn't the factor of concern here.

To that I would say that...

Platform updates on systems like KVM are trivially easy, insanely fast, and not a problem at all. You probably only see this as a concern at all because you were running Hyper-V (or VMware) where system updates are problematic. KVM and Xen aren't like this. So if you use them, I think your entire premise for this evaporates.

I would simply not do this. Even if I have HA, I wouldn't do it because prime production time is never the time to test your failover systems. Everything fails and HA is a highly risky operation even under ideal conditions and there is no vendor that is completely reliable. Doing off hours maintenance isn't just trivially easy, but it is easily scripted during production hours. Why take on cost and risk that is unnecessary? Just do an update and reboot off hours.

This is real and we can't easily work around it. But the decision that these would require off hours support for safety was made at the time of acquisition.

Also, the cost to have an MSP do this for you, if you didn't want to do it, would be far cheaper than the cost of HA. And fix the problems way more thoroughly because it would address everything, not just one piece.

All good points here.

Platform updates being risky - I agree with that piece for Windows. We've all been in situations before where a simple update takes forever, and there is not progress on Windows. In all my cases with Windows, it always comes back, but compared to a KVM host update, it can sometimes be a night and day different.
Updating during normal business hours - I must have misspoke here. I never do this. I always do update after hours as much as possible.
Fragile applications - Yep, this is true in some cases. It's not what you want as the person supporting the ops piece of the puzzle, but sometimes that's what you are working with at the moment.

Taking time to think about this more, the real want for HA stems from fear of the unknown and time to resolution in case of disaster. Example:

I do an update to the host and it bombs for reason X. How fast can I get my backups restored on another host?
A host crashes - Same question. How fast can I restore to another host?
My team is small, can anyone restore properly in a timely fashion?
Would a simple failover system (let's not call it HA, just an automatic failover to another host) be a good solution to be able to keep the VMs running until the failed host is fixed?

All of the points brought up by you and others definitely make me pause and take a step back and really think about what the source of this post really stems from. So thanks for that. Time to think on this a bit more.
(also, now following the Proxmox thread as well )

scottalanmiller

@fuznutz04 said in Greenfield HA environment choices:

I do an update to the host and it bombs for reason X. How fast can I get my backups restored on another host?

Without HA, this can still be seconds or minutes. You need good procedures, but HA isn't the real protection there.

scottalanmiller

@fuznutz04 said in Greenfield HA environment choices:

My team is small, can anyone restore properly in a timely fashion?

This is a strong argument against HA. HA generally requires more experience and knowledge. It's more complex.

scottalanmiller

Generally, for most use cases, just having two servers and good backups is the best option. If you have a greenzone, turn your VMs off, do your updates, turn them back on.

Updates essentially never cause issues. Not on hypervisors (at least not on KVM/Xen.) Putting in a lot of complexity, cost, or risk to mitigate a shark attack isn't worth it. You will focus on a false risk.