New Infrastructure to Replace Scale Cluster

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

No, FT is about zero service interruptions, you're basically running multiple copies of the same service in either standby or load-balanced manner, and if one of a few (depends on the SLA) service instances go down, overall service availability is not harmed. With HA, the service availability is higher than with none, but it's not meant to be zero downtime in case of failure. All HA does is monitor the service and use technical means of making sure it stays up as much as possible, by failing the service over to another resource (a DR site, a standby host, a DC on Mars - doesn't matter).

FT does mean instant "transparent" failover, but it isn't without risk. If it was, Boeing wouldn't be having the issues it has today (literally today.) You are right, HA is about improving A, nothing more. It has to be better than SA (standard A) but doesn't imply FT or anything like FT.

Ive said many times, a really great vendor hardware agreement can provide HA pretty cheaply in the right circumstances (in Manhattan, HPE has a response time of around 15 minutes to get hardware into datacenters, for real!) That won't work on a farm in Iowa, but it can work in Manhattan.

HA is achieve in any way that results in HA as the final product. FT is a specific type of thing for achieving HA (that's its purpose, better availability) but is very specific in how it tries to do it.

The key difference is...

HA is a concept or term for resulting risk or projected availability. It's just a "rating".
FT is a mechanism or family / class of mechanisms whose intended purpose is to get you to or beyond HA. But if FT doesn't get you HA, and it often doesn't, it doesn't make it not-FT, it just makes it not HA.

Or to put it another way, HA is an end and FT is a means. FT's only goal isn't HA, it is also no service interruption which is not addressed in HA at all. But FT can still fail, and in some cases fail big time, and that can lead to missing HA (or even SA) even with FT.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

All HA does is monitor the service and use technical means of making sure it stays up as much as possible, by failing the service over to another resource (a DR site, a standby host, a DC on Mars - doesn't matter).
With today's advent of distributed services, FT is what is usually being run, since it does make more sense, but HA is still used for the bulkier services, like VMs or large service nodes (e.g. openstack controllers)

I'm not sure on the wording here. You are using HA as if it is a "type" of protection, but it doesn't mean that in any circle. FT is a specific means to increasing availability, but HA is not.

You can determine FT by looking at the way components failover. If they failover in a specific, non-disruptive way, that is FT. You can only determine HA by looking at the resulting availability of a whole system (where whole system refers to the entire system being evaluated for scope.) FT scope is component, HA scope is system.

An example that @NetworkNerd and @Texkonc will remember is NEC's FT servers. They aren't HA, just FT. They use two super crappy servers that normal IT would never even accept to be servers, they are so low in quality and reliability, but then they layer on a hardware FT solution. It's absolutely FT in every way, but a total joke of a product and doesn't meet HA. It's SA, but only barely. But super silly, but they sell it because it hits the FT checkbox.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

But it is also incredibly true. You can't "just buy" HA, doesn't work. No product anywhere can do HA if you don't treat it properly or make the things around it support HA as well. It doesn't require "inventing anything", it's just obvious common sense.

I don't see it that way. If you want HA for a service, you pay for a solution to make it HA, that includes all the components that will protect it from various types of failure as well as provide SBA (several layers of SBA if you have the money). But in the end, it all translates to dollars. You level of paranoia (or the list of stuff you want to protect the service from) vs your budget.

If you isolate the scope dramatically, maybe. But that's pretty weird to do with HA. Most companies want (or need) HA at a service level (they need the resulting service to be HA to the company.)

Example: Our CRM application needs to be able to be used by staff five nines of the time (okay, no one needs a CRM that much but...)

You can't really buy HA for that. You need to encompass not just the storage and app server, but the network it is attached to, the datacenter(s) it is in, the data shuffling between data centers, the generators, the people and processes around all of it, maintenance schedules, and on and on. Theoretically you can buy anything as a service, but in reality, you have to essentially outsource IT, HR, Facilities, Purchasing and maybe some others to get HA for the service as there are so many internal pieces that have to come together to make it all work. The ability to make the system HA simply requires such scope and authorization that a vendor basically can never do it unless that vendor is an IT outsourcer, in which case, we call it "doing" not "buying" because it is the same as internal staff and not productized.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

No, the solution provides protection against a hypervisor failure. You also want to protect against switch failure - buy another switch, you also want to protect against storage failure - make the storage HA. oVirt isn't a networking or storage platform, it uses storage and networking. What oVirt does do is run VMs and control hypervisors. So if a VM dies or a hypervisor dies - oVirt will provide the HA.

No, oVirt provides the mechanism, like FT. oVirt cannot guarantee HA. If you are looking at HA as a mechanism, then I can see why this would seem to make sense. But if you look at HA as "high availability", literally availability that is higher than normal, then it is clear that it can't be a mechanism and that any given mechanism can help to achieve it, but can't provide it itself.

So if a VM dies, oVirt will provide non-FT failover, yes. And by having failover you might achieve HA. And oVirt is a critical part of making that possible. But it itself isn't HA, nor does it guarantee HA. It's just a failover component that you can use to "do" HA.

dyasny

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

FT does mean instant "transparent" failover, but it isn't without risk. If it was, Boeing wouldn't be having the issues it has today (literally today.) You are right, HA is about improving A, nothing more. It has to be better than SA (standard A) but doesn't imply FT or anything like FT.

It means transparent failover for vmware, but in general in means the service can tolerate a certain amount of resources failing without having downtime.

Ive said many times, a really great vendor hardware agreement can provide HA pretty cheaply in the right circumstances (in Manhattan, HPE has a response time of around 15 minutes to get hardware into datacenters, for real!) That won't work on a farm in Iowa, but it can work in Manhattan.

They won't ship anything until they've done at least a bit of diags anyway, so those 15 minutes aren't really 15 minutes. But that's also nowhere near the topic for this conversation

HA is achieve in any way that results in HA as the final product. FT is a specific type of thing for achieving HA (that's its purpose, better availability) but is very specific in how it tries to do it.

The key difference is...

HA is a concept or term for resulting risk or projected availability. It's just a "rating".
FT is a mechanism or family / class of mechanisms whose intended purpose is to get you to or beyond HA. But if FT doesn't get you HA, and it often doesn't, it doesn't make it not-FT, it just makes it not HA.

Or to put it another way, HA is an end and FT is a means. FT's only goal isn't HA, it is also no service interruption which is not addressed in HA at all. But FT can still fail, and in some cases fail big time, and that can lead to missing HA (or even SA) even with FT.

FT and HA are different techniques achieving different goals. Just like, for example, backup and raid. Yes, in certain situations, FT can replace HA, but that doesn't make one a subset of the other, it makes one a potential replacement for the other, and quite often the replacement isn't the right thing to do.

You can't really buy HA for that. You need to encompass not just the storage and app server, but the network it is attached to, the datacenter(s) it is in, the data shuffling between data centers, the generators, the people and processes around all of it, maintenance schedules, and on and on. Theoretically you can buy anything as a service, but in reality, you have to essentially outsource IT, HR, Facilities, Purchasing and maybe some others to get HA for the service as there are so many internal pieces that have to come together to make it all work. The ability to make the system HA simply requires such scope and authorization that a vendor basically can never do it unless that vendor is an IT outsourcer, in which case, we call it "doing" not "buying" because it is the same as internal staff and not productized.

Well, I call anything you pay for "buying". Unless instead of money I give a goat away for it, then it's "bartering". Now, jokes aside, you keep to keep an eye on the scope for the HA and the budget for it. You can, of course, cover every possible corner case failure scenario and buy more tech to protect against it, but you can only go so far. Which means, following your own logic, that nothing you ever buy or "do" will ever give you real true veritable HA, only slightly hightened A, and no more, because there's always the ever fleeting chance of yet another potential disaster that can still cause an outage no matter what you do.

Starting small, you buy a server and install an app. The app has a watchdog to make sure it hadn't crashed and start it again if it has. That's a very minor HA protecting against a very small scope of failure. You set up raid1 in that server - you just increased the HA by covering disk failure. Keep growing that by small steps, and you have a cluster with redundant everything and DR sites across the universe, everything replicating using an ansible device and whatnot. Whoops, a new big bang happens, wiping out the universe (albeit very slowly), that's an event you still haven't covered, so, following your logic - you never had HA.

dyasny

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

So if a VM dies, oVirt will provide non-FT failover, yes. And by having failover you might achieve HA. And oVirt is a critical part of making that possible. But it itself isn't HA, nor does it guarantee HA. It's just a failover component that you can use to "do" HA.

It's a component you buy (well, this one is opensource, but still), part of a solution that would cover other potential failure points. The solution in general is also something you can buy, in order to achieve a certain level of HA for those VMs