Greenfield HA environment choices



  • I've been thinking a lot about what choice to make if I were to go for a high availability system for VMs.

    In a past life, I worked for a medium size software hosting company. They were setup with a full Hyper-V failover cluster, including the standard SAN, etc. It was setup before I got there, but it worked great. The hyper-visors were setup correctly (meaning they used Hyper-V server, NOT the Hyper-V role), they used Cluster Failover manager, etc. Doing system maintenance was a breeze, because I would failover all the VMs to the other hosts, perform updates to the node, and then move them back. This was back when I was using Hyper-V server 2012. As of now, a ton more features are available , including cluster aware updating and Windows Admin center which allows you to manage the whole lot from a web browser. Not too shabby.

    So, If you create a cluster with Hyper-V server (not the ROLE), and plan to have only Linux VMs on the cluster, thus eliminating the whole licensing downside to MS, what are the main pros and cons of going with the MS solution, VS something else in the Linux world? I'm not leaning one way or another at this point, but I've been so engrossed in the Linux world for the past couple of years, that MS Hyper-V wasn't even a thought in my mind. But now that I'm thinking about it, I'm not really thinking about any big downsides.

    Can't wait to get some good discussion going on this.



  • @fuznutz04 said in Greenfield HA environment choices:

    I worked for a medium size software hosting company. They were setup with a full Hyper-V failover cluster, including the standard SAN, etc.

    That's not a good way for a software hosting company. Platform HA is a fallback for when you don't know or control the applications that you host. It's never what you want, just what you are sometimes stuck with.



  • @fuznutz04 said in Greenfield HA environment choices:

    Doing system maintenance was a breeze, because I would failover all the VMs to the other hosts, perform updates to the node, and then move them back.

    If you need to do this, it means you have no HA for the OS. What if the OS needs patched or fails? The platform (Hyper-V) HA would be useless. It would simply protect the failed system.

    If you are looking for HA, this doesn't cut it. What you have isn't bad there, but it doesn't qualify as HA because you are only looking at one of the layers and not protecting the stack. It's a "hypervisor HA", but that's not what any business or IT means when they say HA... they mean continuity of the resultant services.

    To put it another way... you are looking at HA of the means, not of the ends.



  • So start by looking at your workloads. Especially if you are on Linux for the VMs, why do you need platform level protection? What is missing at the application layer to create this problem?

    Don't get me wrong, I'm not saying that this fallback isn't commonly needed, but don't just assume that it is. Especially not having Windows, it's extremely rare that you would need this.



  • @scottalanmiller said in Greenfield HA environment choices:

    @fuznutz04 said in Greenfield HA environment choices:

    Doing system maintenance was a breeze, because I would failover all the VMs to the other hosts, perform updates to the node, and then move them back.

    If you need to do this, it means you have no HA for the OS. What if the OS needs patched or fails? The platform (Hyper-V) HA would be useless. It would simply protect the failed system.

    If you are looking for HA, this doesn't cut it. What you have isn't bad there, but it doesn't qualify as HA because you are only looking at one of the layers and not protecting the stack. It's a "hypervisor HA", but that's not what any business or IT means when they say HA... they mean continuity of the resultant services.

    To put it another way... you are looking at HA of the means, not of the ends.

    That's a valid point. But let's say you are just looking at hypervisor HA. ( not feasible to move the stack to OS HA anywhere in the near future) Apply the same question.



  • @fuznutz04 said in Greenfield HA environment choices:

    not feasible to move the stack to OS HA anywhere in the near future

    You sure the stack isn't HA already? All standard stacks are, they have to be.



  • Hyper-V is a decent choice if all you want is to protect the platform and not the functionality. If you have real HA, this is useless and will often work against you and at the very least make things more costly and difficult. But if you lack HA and need a platform level alternative, Hyper-V with Starwind is a really good choice. Starwind is the only mechanism I'd consider in that situation.



  • @fuznutz04 said in Greenfield HA environment choices:

    They were setup with a full Hyper-V failover cluster, including the standard SAN

    Worth noting, if using the "standard SAN", there wasn't even HA at the platform level in that old design. True HA requires that the full stack be HA in providing services in an HA way. Platform HA would require being able to deliver the platform layer in an HA way, a SAN is the antithesis of that in both cases. You can, with extreme cost and complexity, mostly overcome the SAN problems, but never entirely and never easily.



  • KVM with Starwind or DRBD will do this, too.



  • XCP-NG is also an option. Pretty much everyone has this built in today. XCP doesn't have Starwind as an option. Hyper-V doesn't have an generally accepted native storage option, but as Starwind is native to it, it doesn't matter.



  • @scottalanmiller said in Greenfield HA environment choices:

    KVM with Starwind or DRBD will do this, too.

    I thought Starwind is NOT available on KVM hyper-visors?



  • @scottalanmiller said in Greenfield HA environment choices:

    Hyper-V is a decent choice if all you want is to protect the platform and not the functionality. If you have real HA, this is useless and will often work against you and at the very least make things more costly and difficult. But if you lack HA and need a platform level alternative, Hyper-V with Starwind is a really good choice. Starwind is the only mechanism I'd consider in that situation.

    I was thinking about Starwind. I like what they do. (I like hyper-convergence in general)



  • @fuznutz04 said in Greenfield HA environment choices:

    I like what they do. (I like hyper-convergence in general)

    It's the only feasible way to get platform HA.



  • @fuznutz04 said in Greenfield HA environment choices:

    I was thinking about Starwind.

    They make the best VSAN storage layer. Insane performance.



  • @scottalanmiller said in Greenfield HA environment choices:

    @fuznutz04 said in Greenfield HA environment choices:

    I was thinking about Starwind.

    They make the best VSAN storage layer. Insane performance.

    Any idea on pricing?



  • @fuznutz04 said in Greenfield HA environment choices:

    @scottalanmiller said in Greenfield HA environment choices:

    @fuznutz04 said in Greenfield HA environment choices:

    I was thinking about Starwind.

    They make the best VSAN storage layer. Insane performance.

    Any idea on pricing?

    They always have a free option.



  • @fuznutz04 said in Greenfield HA environment choices:

    I've been thinking a lot about what choice to make if I were to go for a high availability system for VMs.

    In a past life, I worked for a medium size software hosting company. They were setup with a full Hyper-V failover cluster, including the standard SAN, etc. It was setup before I got there, but it worked great. The hyper-visors were setup correctly (meaning they used Hyper-V server, NOT the Hyper-V role), they used Cluster Failover manager, etc. Doing system maintenance was a breeze, because I would failover all the VMs to the other hosts, perform updates to the node, and then move them back. This was back when I was using Hyper-V server 2012. As of now, a ton more features are available , including cluster aware updating and Windows Admin center which allows you to manage the whole lot from a web browser. Not too shabby.

    So, If you create a cluster with Hyper-V server (not the ROLE), and plan to have only Linux VMs on the cluster, thus eliminating the whole licensing downside to MS, what are the main pros and cons of going with the MS solution, VS something else in the Linux world? I'm not leaning one way or another at this point, but I've been so engrossed in the Linux world for the past couple of years, that MS Hyper-V wasn't even a thought in my mind. But now that I'm thinking about it, I'm not really thinking about any big downsides.

    Can't wait to get some good discussion going on this.

    Greenfield environment? Totally situational IMO.

    What are you running on all of the Linux VMs? What kind of HA do you need? Hardware-level HA? VM level? Site? At what level? App level? Service level (can ping the site, but web app doesn't work)? Network level (everything is up, but nobody can access it)? Etc...? All of them?
    Would it make sense to run these services and/apps in the cloud in a likely native HA environment with minimal effort and upfront costs?
    It totally depends on what you got going on.

    Just wondering.... can you give a specific use case other than just wanting some HA VMs? It's kinda hard to answer generally (for me).



  • But if you want support, which is always a good idea in production, just ping @Oksana on here.



  • @Obsolesce said in Greenfield HA environment choices:

    Would it make sense to run these services and/apps in the cloud in a likely native HA environment

    No standard cloud offers platform HA. No major players. AWS, Azure, Vultr, Digital Ocean, Linode... none do HA.



  • @Obsolesce said in Greenfield HA environment choices:

    @fuznutz04 said in Greenfield HA environment choices:

    I've been thinking a lot about what choice to make if I were to go for a high availability system for VMs.

    In a past life, I worked for a medium size software hosting company. They were setup with a full Hyper-V failover cluster, including the standard SAN, etc. It was setup before I got there, but it worked great. The hyper-visors were setup correctly (meaning they used Hyper-V server, NOT the Hyper-V role), they used Cluster Failover manager, etc. Doing system maintenance was a breeze, because I would failover all the VMs to the other hosts, perform updates to the node, and then move them back. This was back when I was using Hyper-V server 2012. As of now, a ton more features are available , including cluster aware updating and Windows Admin center which allows you to manage the whole lot from a web browser. Not too shabby.

    So, If you create a cluster with Hyper-V server (not the ROLE), and plan to have only Linux VMs on the cluster, thus eliminating the whole licensing downside to MS, what are the main pros and cons of going with the MS solution, VS something else in the Linux world? I'm not leaning one way or another at this point, but I've been so engrossed in the Linux world for the past couple of years, that MS Hyper-V wasn't even a thought in my mind. But now that I'm thinking about it, I'm not really thinking about any big downsides.

    Can't wait to get some good discussion going on this.

    Greenfield environment? Totally situational IMO.

    What are you running on all of the Linux VMs? What kind of HA do you need? Hardware-level HA? VM level? Site? At what level? App level? Service level (can ping the site, but web app doesn't work)? Network level (everything is up, but nobody can access it)? Etc...? All of them?
    Would it make sense to run these services and/apps in the cloud in a likely native HA environment with minimal effort and upfront costs?
    It totally depends on what you got going on.

    Just wondering.... can you give a specific use case other than just wanting some HA VMs? It's kinda hard to answer generally (for me).

    Sure thing. Basically, VM and hardware level. There are plenty of environments and workloads still in existence that are not HA on the application layer. So the ability to provide extra protection from downtime, etc, if the cost is within acceptable budgets, can be worth it.

    Example:

    • Windows VMs with software that is not designed for shared databases, shared web hosts, etc.

    • PBX - Lets say I host a PBX for my company. I want to do maintenance on the node hosting this. I want to do it during the day. I want to live migrate that VM to another node. I don't need application level failover, I just need to move it and not have downtime.

    • Any other software/workload that for one reason or another, cannot reasonably be moved to a "true HA" solution.

    We all know that the BEST scenario is to build your applications against best practices, allowing for HA type functionality. But what about those businesses who are not ready to make that investment? That's what I was thinking about.



  • @scottalanmiller said in Greenfield HA environment choices:

    But if you want support, which is always a good idea in production, just ping @Oksana on here.

    Is Max still around? I think I met him at Mangocon last year.



  • @fuznutz04 said in Greenfield HA environment choices:

    @scottalanmiller said in Greenfield HA environment choices:

    But if you want support, which is always a good idea in production, just ping @Oksana on here.

    Is Max still around? I think I met him at Mangocon last year.

    Yes, he is.

    @Stuka



  • @fuznutz04 said in Greenfield HA environment choices:

    Windows VMs with software that is not designed for shared databases, shared web hosts, etc.

    Yeah, but you said you were all Linux VMs, right? Or did I misread that?

    Pretty much no production software is this way. There is certainly a lot of bad software out there, but if the application isn't HA-able, that alone is generally a big red flag that it's not released at a level IT would consider "production" yet.



  • @fuznutz04 said in Greenfield HA environment choices:

    Example:

    • Windows VMs with software that is not designed for shared databases, shared web hosts, etc.

    This is definitely not a modern application. Not one any serious business would consider.



  • @fuznutz04 said in Greenfield HA environment choices:

    PBX - Lets say I host a PBX for my company. I want to do maintenance on the node hosting this. I want to do it during the day. I want to live migrate that VM to another node. I don't need application level failover, I just need to move it and not have downtime.

    Sure, but how do you update the PBX itself, which is easily 20x more frequent of a need than updating the host.

    Also, if you avoid Hyper-V, updating the host rarely means downtime to the VMs. We update our KVM hosts daily.

    I think this is a false need, or nearly so. It sounds good, but doesn't really play out in the real world. If you have platform updates, you have more PBX updates and need downtime if you don't have HA.

    Also worth noting, even VMware has huge risk to doing this. Production practices state that to do HA failovers at a platform level like this you have to be ready for downtime. Modern systems are way better than they used to be, but if you do this and it goes down (and I've seen Wall St. go down using VMware for this) it's 100% on you for having done something really risky while the system was in use. This is something I'd never do unless there are no calls currently, or there is actual HA to handle it.

    It's just one of those situations where... if you need the protection you need the protection, and if you don't you don't. What you are asking about is a niche that falls between all realistic scenarios. It might sound reasonable, but in reality, it's not. Not for this workload at least.



  • @fuznutz04 said in Greenfield HA environment choices:

    Any other software/workload that for one reason or another, cannot reasonably be moved to a "true HA" solution.

    Right but.... here is the rub...

    If their uptime doesn't matter, do you need this platform HA since it's already determined that they don't need this level of protection?

    If they do need this uptime, doesn't that need to be addressed?



  • @fuznutz04 so what I'm thinking that I am hearing is, and correct me if I am wrong...

    1. You perceive platform updates as carrying risk and headache.
    2. You want to do platform updates at the riskiest times (prime production) rather than waiting on a greenzone. I assume because you work during prime production and sleep during what would be the greenzone.
    3. You run fragile applications that we generally would not consider production ready and don't care about uptime protection on them in general (but don't want unnecessary downtime, either.)

    This is what I think that I am hearing, both in your descriptions of why you want this feature, as well as your perception that the other company's non-HA solution was "great". You are looking at it as a feature to make IT management easier, so the application availability isn't the factor of concern here.

    To that I would say that...

    1. Platform updates on systems like KVM are trivially easy, insanely fast, and not a problem at all. You probably only see this as a concern at all because you were running Hyper-V (or VMware) where system updates are problematic. KVM and Xen aren't like this. So if you use them, I think your entire premise for this evaporates.
    2. I would simply not do this. Even if I have HA, I wouldn't do it because prime production time is never the time to test your failover systems. Everything fails and HA is a highly risky operation even under ideal conditions and there is no vendor that is completely reliable. Doing off hours maintenance isn't just trivially easy, but it is easily scripted during production hours. Why take on cost and risk that is unnecessary? Just do an update and reboot off hours.
    3. This is real and we can't easily work around it. But the decision that these would require off hours support for safety was made at the time of acquisition.

    Also, the cost to have an MSP do this for you, if you didn't want to do it, would be far cheaper than the cost of HA. And fix the problems way more thoroughly because it would address everything, not just one piece.



  • @scottalanmiller said in Greenfield HA environment choices:

    @fuznutz04 so what I'm thinking that I am hearing is, and correct me if I am wrong...

    1. You perceive platform updates as carrying risk and headache.
    2. You want to do platform updates at the riskiest times (prime production) rather than waiting on a greenzone. I assume because you work during prime production and sleep during what would be the greenzone.
    3. You run fragile applications that we generally would not consider production ready and don't care about uptime protection on them in general (but don't want unnecessary downtime, either.)

    This is what I think that I am hearing, both in your descriptions of why you want this feature, as well as your perception that the other company's non-HA solution was "great". You are looking at it as a feature to make IT management easier, so the application availability isn't the factor of concern here.

    To that I would say that...

    1. Platform updates on systems like KVM are trivially easy, insanely fast, and not a problem at all. You probably only see this as a concern at all because you were running Hyper-V (or VMware) where system updates are problematic. KVM and Xen aren't like this. So if you use them, I think your entire premise for this evaporates.
    2. I would simply not do this. Even if I have HA, I wouldn't do it because prime production time is never the time to test your failover systems. Everything fails and HA is a highly risky operation even under ideal conditions and there is no vendor that is completely reliable. Doing off hours maintenance isn't just trivially easy, but it is easily scripted during production hours. Why take on cost and risk that is unnecessary? Just do an update and reboot off hours.
    3. This is real and we can't easily work around it. But the decision that these would require off hours support for safety was made at the time of acquisition.

    Also, the cost to have an MSP do this for you, if you didn't want to do it, would be far cheaper than the cost of HA. And fix the problems way more thoroughly because it would address everything, not just one piece.

    All good points here.

    1. Platform updates being risky - I agree with that piece for Windows. We've all been in situations before where a simple update takes forever, and there is not progress on Windows. In all my cases with Windows, it always comes back, but compared to a KVM host update, it can sometimes be a night and day different.
    2. Updating during normal business hours - I must have misspoke here. I never do this. I always do update after hours as much as possible.
    3. Fragile applications - Yep, this is true in some cases. It's not what you want as the person supporting the ops piece of the puzzle, but sometimes that's what you are working with at the moment.

    Taking time to think about this more, the real want for HA stems from fear of the unknown and time to resolution in case of disaster. Example:

    I do an update to the host and it bombs for reason X. How fast can I get my backups restored on another host?
    A host crashes - Same question. How fast can I restore to another host?
    My team is small, can anyone restore properly in a timely fashion?
    Would a simple failover system (let's not call it HA, just an automatic failover to another host) be a good solution to be able to keep the VMs running until the failed host is fixed?

    All of the points brought up by you and others definitely make me pause and take a step back and really think about what the source of this post really stems from. So thanks for that. Time to think on this a bit more.
    (also, now following the Proxmox thread as well 🙂 )



  • @fuznutz04 said in Greenfield HA environment choices:

    I do an update to the host and it bombs for reason X. How fast can I get my backups restored on another host?

    Without HA, this can still be seconds or minutes. You need good procedures, but HA isn't the real protection there.



  • @fuznutz04 said in Greenfield HA environment choices:

    My team is small, can anyone restore properly in a timely fashion?

    This is a strong argument against HA. HA generally requires more experience and knowledge. It's more complex.