New Infrastructure to Replace Scale Cluster

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

you're obviously not inventing the term HA, but you are inventing this ridiculous saying about HA being what you do and not what you buy.

No, I've clearly quoted the source on that every time. It's @StorageNinja

But it is also incredibly true. You can't "just buy" HA, doesn't work. No product anywhere can do HA if you don't treat it properly or make the things around it support HA as well. It doesn't require "inventing anything", it's just obvious common sense.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

Everyone wants to sell solutions, not products, to the point where these solutions in turn become productized. And an HA solution can be bought.

Sure, but that is 1) almost always untrue, very few vendors actually do anything to achieve HA 2) when they do, they are becoming the IT department and building, not buying, HA.

Basically what you are doing is agreeing with John, but playing a services semantics game to look at it from a CEO's perspective that he can "buy an IT department that will be tasked with implementing HA". So if you consider your staffing, hiring, and business decisions to be "something you can buy like a product", then sure. But all you've done is define "anything you do" as something purchasable.

DustinB3403

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

JaredBusch

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

DustinB3403

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

Well now I know where my paycheck is going to!

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

Out of the box with oVirt, this provides us with the ability to make VMs highly available, just configure the hosts, and mark the VMs as HA, nothing else. Assuming networking was done properly, and we have switch redundancy, and hoping the building has a generator and the storage is also reliable is nice, but out of scope for this particular scenario. All the OP wanted was VM HA, where if a VM dies or the host carrying it stops being able to host it, it gets safely started elsewhere. It isn't hard to grasp the concept, really.

See, this is where it all falls apart. Your example proves our point. In your example with the IPOD, you aren't even trying to make HA. oVirt's HA option is actually terrible here because it tricks the humans into thinking there is protection where there is not and makes people often (where makes = allows their brains to accept) introduce more risk, lowing their availability below standard, by seeing the term HA applies to one isolated layer of the stack and ignoring the increased risk of the overall stack.

It's not just an example of how HA fails, it is THE example that we've been talking about for a decade. It's the "buy the book" worst possible setup where you add lots of hardware, add "HA products", and the result is a system that is slower and more fragile than if you hadn't done it at all. It's the exact scenario we did the risk analysis on a few years ago, the one that has been beaten to death.

If you feel the math is wrong, go after that. But you are just repeating the identical arguments that people on always did and doing the same "not looking where the problem is" and instead looking at the term HA and ignoring that to turn it on you had to create a huge amount of risk that wasn't there originally. It's all the standard marketing trick.

There is no pissing match, this is tried and true, obvious and well known problems with HA marketing. It's an example that is trivial to show proves the point (and has been, SO many times.) And the only thing going on is you are making a fresh argument for something that was long ago shown to be LA and acting like we don't all already know this. Your statements about it suggest that you either haven't read about it and haven't examined the risk in whole, or you are knowingly ignoring the body of work on it and thinking that if you pretend the evidence hasn't been presented that rehashing it will cause us to forget where the risk is.

Why you are arguing the point in this way doesn't make sense. Because you aren't stating where the math is wrong, you are just ignoring that math has been presented over and over again and no one in a decade has ever ended up disagreeing with the evidence. So it is not how to present a logical argument.

scottalanmiller

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

And you can then call it something "you buy" rather than "something you do".

scottalanmiller

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

Well now I know where my paycheck is going to!

To be used as toilet paper?

DustinB3403

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

Well now I know where my paycheck is going to!

To be used as toilet paper?

To pay people to do things that I do myself!

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

I frankly don't remember the last time I actually bought anything from a vendor. People usually go to an integrator, and pay for building a solution for them.

Integrator is industry speak for the vendor advocate. When anyone in IT says vendor, they mean integrator. It's just accepted that they are the channel arm for their vendors and to the IT side they are one and the same. Both are vendor advocates, both are sales people, one just repeats the marketing of the other.

Dashrender

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

Well now I know where my paycheck is going to!

To be used as toilet paper?

LOL - nice!!
I see what you did there.

scottalanmiller

The marketing trick being used here is "shifted risk." It's a "slight of hands" way to make something look like HA and it is the most common approach used to sell "HA equipment" that isn't actually HA. It's popular because it provides and affordable setup, at great margins, and high cost, but not so high as to break the customer's bank. But it cost quite a bit more than a true HA setup, and so is popular as a way for vendors (and integrators) to raise the costs without raising them beyond acceptable limits.

The trick is that they identify the risks of the initial system. They call it "the app server" to begin the trick. In a single server setup, this isn't correct, the single server is the "entire physical system", you can't isolate one part of it like that, but they do and hence the trick begins.

Next they say "how do we eliminate this risk"? Answer: make it redundant. Another slight of hand, redundant means "an extra thing" not "it is fault tolerant" but the average person thinks that they must mean the later and hears that in their head regardless of what is actually said. Of course this is wrong, because the risk is of the entire system, not just one piece, but they only offered to make one piece redundant. In theory, everyone should know that the thing is a trick at this point, but people always want to give the benefit of the doubt because hearing that "HA can be bought" triggers an emotional reaction and makes us hope that it is really so easy and automatic that it requires nothing from us.

Next the big risks, the storage and other system components are shifted out of the single box and put somewhere else with a "no need to look over there, it's magic" attitude and people happily agree that since storage is hard, they won't look any further. Every integrator pulls this trick, this is where the money is. So people assume that since everyone does it, it must be right.

Then the integrator adds in more risk by needing switches. But they, again, simply raise the cost by saying it needs to be redundant and since it is redundant, ignore that it carries risks too.

The end result is the integrator gets to sell not just the two servers the customer didn't need (because if they bought this, they clearly didn't need HA at all), but also a third server, and two extra switches, and a lot of set up man power. The vendor piece of the equation is thrilled because they easily double (or more) their hardware sales. And the integrator piece is thrilled because they didn't just double their margins but they also use the unneeded complexity to push for a lot of integrator hours and ongoing support because they made something that was best left simple, into something really complex.

And all of this by somehow convincing people that "risk isn't additive", which is amazing that people can be tricked of that. The storage server, at the end of the day, carries literally all of the risk of the original server and isn't removed from the risk pool in any way. If we had kept our eye on the storage component of the original setup instead of being tricked by being directed to the application piece, it all becomes really obvious, really quickly. That's how the card trick is played - look at this motion while I hide the card you were trying to watch.

Then, once we take our eyes off of the "unaddressed initial risk", all of the other risk, the application servers, the switches, the cabling... is all "extra risk layered on top of the unaddressed initial risk." And math is math. More risk is riskier than less risk. It's literally that simple.

scottalanmiller

This IPOD setup, and the ~~vendors~~ integrators that sell it we called "the standard scam of SMB IT" a year ago.

Youtube Video

dyasny

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

That might be a way to achieve HA, but it isn't the definition of it. The definition of it comes from the level of availability, nothing to do with the mechanisms or attempts at it. You are correct that people often confuse HA (a level of availability) with FT (a physical type of protection whose purpose is to help provide HA, or at least "better A".)

No, FT is about zero service interruptions, you're basically running multiple copies of the same service in either standby or load-balanced manner, and if one of a few (depends on the SLA) service instances go down, overall service availability is not harmed. With HA, the service availability is higher than with none, but it's not meant to be zero downtime in case of failure. All HA does is monitor the service and use technical means of making sure it stays up as much as possible, by failing the service over to another resource (a DR site, a standby host, a DC on Mars - doesn't matter).

With today's advent of distributed services, FT is what is usually being run, since it does make more sense, but HA is still used for the bulkier services, like VMs or large service nodes (e.g. openstack controllers)

But it is also incredibly true. You can't "just buy" HA, doesn't work. No product anywhere can do HA if you don't treat it properly or make the things around it support HA as well. It doesn't require "inventing anything", it's just obvious common sense.

I don't see it that way. If you want HA for a service, you pay for a solution to make it HA, that includes all the components that will protect it from various types of failure as well as provide SBA (several layers of SBA if you have the money). But in the end, it all translates to dollars. You level of paranoia (or the list of stuff you want to protect the service from) vs your budget.

Sure, but that is 1) almost always untrue, very few vendors actually do anything to achieve HA 2) when they do, they are becoming the IT department and building, not buying, HA.

All the vendors dealing with IT infrastructure have tons of HA oriented solutions. From hardware vendors selling RAID controllers, power management devices and redundant PSUs, to software vendors building support for HA into their products. I find it hard to think of any IT product I've used recently that didn't have HA or FT built in.
That doesn't make any sense.

See, this is where it all falls apart. Your example proves our point. In your example with the IPOD, you aren't even trying to make HA. oVirt's HA option is actually terrible here because it tricks the humans into thinking there is protection where there is not and makes people often (where makes = allows their brains to accept) introduce more risk, lowing their availability below standard, by seeing the term HA applies to one isolated layer of the stack and ignoring the increased risk of the overall stack.

No, the solution provides protection against a hypervisor failure. You also want to protect against switch failure - buy another switch, you also want to protect against storage failure - make the storage HA. oVirt isn't a networking or storage platform, it uses storage and networking. What oVirt does do is run VMs and control hypervisors. So if a VM dies or a hypervisor dies - oVirt will provide the HA. If all you have are three hosts, what I advised on isn't the safest solution, but it is the easiest to build and run, as well as expand upon later, when budget is available for covering the SPOFs in the setup. It will also provide better performance than using gluster, and there might be more advantages in terms of disk space availability, depending on the hardware the OP has.

I'm not arguing it is safer this way, I'm saying it is easier to build and reasonably safe if backups are done, especially if there is more budget to come in later in the game. Something to get started with.

The user has to be aware of the SPOFs in the system, I'm not saying there are none. The user also has to work at eliminating them, but again, you're applying the logic of a large company with large resources to a tiny little shop with 3 machines. At this level, they might as well just run everything locally on 3 disparate hosts and be done, that's even easier. The point of oVirt here is to start a proper virtual DC from some small set of available boxes and grow it into a proper solution. Nothing involving 3 hosts and a switch can provide real HA in any case.

So please stop running away into depths that are irrelevant to this particular setup. We already know you're an IT bigshot, there's no need to keep showing off.

Integrator is industry speak for the vendor advocate. When anyone in IT says vendor, they mean integrator. It's just accepted that they are the channel arm for their vendors and to the IT side they are one and the same. Both are vendor advocates, both are sales people, one just repeats the marketing of the other.

Sure, but that integrator will do everything for you - networking, HVAC, servers, storage, software - all designed and built as per your requirements, as one single product - you DC.

The marketing trick ...

So much text with nothing actually said. That's actually a marketing trick. Or a politician's trick, whatever.

When you deal with an integrator/vendor/consultant/etc you go over the proposed solution, and it is up to you to see where you're being oversold on stuff and where the proposed parts of the system are actually needed. If you haven't done your due diligence - it's your fault and you should leave the profession, instead of blaming "marketing tricks". All this talk of marketing tricks is basically an attempt at shifting the blame at not having done proper due diligence when signing the purchase contract, nothing more.

When I did freelance DC design, I always pointed out the specific points which could be an SPOF and agreed with the customer on whether they want and can afford to address them or not, instead of just trying to milk them for money. That usually got my overall solution prices to be much lower than what large vendors and integrators offered. I ended up making less money, but having loyal customers who trusted me and kept coming back. If I just came in and started pushing a small enterprise into building wall street level solutions, I'd have been kicked out of the building. But when the solution was up, with known, budget related issues, we always had a plan for the next few years to address these issues, as well as a backup plan to protect the really important aspects of the business. In a few years, gradually, the issues got addressed and the setup grew more reliable, as budget permitted. Had I come in demanding everything was covered from the start, there wouldn't have been any setup to grow and improve in the first place.

Again, I'm not saying there is perfect HA in the solution, I'm not saying it's totally reliable. I'm saying it's a start with some protection against known failures, and a set of SPOFs to be aware of and address in the future. I'm not trying to sell any false promises. For a small business with 3 servers, this is as much as they can hope for anyway.

If you eliminate the storage SPOF, you still have a network SPOF in place, and you're running a badly performing storage system, driving the amount of network traffic up as well as tripling the storage consumption. And you're adding a lot more software into the mix, software that can break, have bugs etc etc etc. All of that without actually having the perfect HA - there's no redundant networking, no redundant HVAC and no DR site. OMG, we're all gonna die!

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

No, FT is about zero service interruptions, you're basically running multiple copies of the same service in either standby or load-balanced manner, and if one of a few (depends on the SLA) service instances go down, overall service availability is not harmed. With HA, the service availability is higher than with none, but it's not meant to be zero downtime in case of failure. All HA does is monitor the service and use technical means of making sure it stays up as much as possible, by failing the service over to another resource (a DR site, a standby host, a DC on Mars - doesn't matter).

FT does mean instant "transparent" failover, but it isn't without risk. If it was, Boeing wouldn't be having the issues it has today (literally today.) You are right, HA is about improving A, nothing more. It has to be better than SA (standard A) but doesn't imply FT or anything like FT.

Ive said many times, a really great vendor hardware agreement can provide HA pretty cheaply in the right circumstances (in Manhattan, HPE has a response time of around 15 minutes to get hardware into datacenters, for real!) That won't work on a farm in Iowa, but it can work in Manhattan.

HA is achieve in any way that results in HA as the final product. FT is a specific type of thing for achieving HA (that's its purpose, better availability) but is very specific in how it tries to do it.

The key difference is...

HA is a concept or term for resulting risk or projected availability. It's just a "rating".
FT is a mechanism or family / class of mechanisms whose intended purpose is to get you to or beyond HA. But if FT doesn't get you HA, and it often doesn't, it doesn't make it not-FT, it just makes it not HA.

Or to put it another way, HA is an end and FT is a means. FT's only goal isn't HA, it is also no service interruption which is not addressed in HA at all. But FT can still fail, and in some cases fail big time, and that can lead to missing HA (or even SA) even with FT.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

All HA does is monitor the service and use technical means of making sure it stays up as much as possible, by failing the service over to another resource (a DR site, a standby host, a DC on Mars - doesn't matter).
With today's advent of distributed services, FT is what is usually being run, since it does make more sense, but HA is still used for the bulkier services, like VMs or large service nodes (e.g. openstack controllers)

I'm not sure on the wording here. You are using HA as if it is a "type" of protection, but it doesn't mean that in any circle. FT is a specific means to increasing availability, but HA is not.

You can determine FT by looking at the way components failover. If they failover in a specific, non-disruptive way, that is FT. You can only determine HA by looking at the resulting availability of a whole system (where whole system refers to the entire system being evaluated for scope.) FT scope is component, HA scope is system.

An example that @NetworkNerd and @Texkonc will remember is NEC's FT servers. They aren't HA, just FT. They use two super crappy servers that normal IT would never even accept to be servers, they are so low in quality and reliability, but then they layer on a hardware FT solution. It's absolutely FT in every way, but a total joke of a product and doesn't meet HA. It's SA, but only barely. But super silly, but they sell it because it hits the FT checkbox.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

But it is also incredibly true. You can't "just buy" HA, doesn't work. No product anywhere can do HA if you don't treat it properly or make the things around it support HA as well. It doesn't require "inventing anything", it's just obvious common sense.

I don't see it that way. If you want HA for a service, you pay for a solution to make it HA, that includes all the components that will protect it from various types of failure as well as provide SBA (several layers of SBA if you have the money). But in the end, it all translates to dollars. You level of paranoia (or the list of stuff you want to protect the service from) vs your budget.

If you isolate the scope dramatically, maybe. But that's pretty weird to do with HA. Most companies want (or need) HA at a service level (they need the resulting service to be HA to the company.)

Example: Our CRM application needs to be able to be used by staff five nines of the time (okay, no one needs a CRM that much but...)

You can't really buy HA for that. You need to encompass not just the storage and app server, but the network it is attached to, the datacenter(s) it is in, the data shuffling between data centers, the generators, the people and processes around all of it, maintenance schedules, and on and on. Theoretically you can buy anything as a service, but in reality, you have to essentially outsource IT, HR, Facilities, Purchasing and maybe some others to get HA for the service as there are so many internal pieces that have to come together to make it all work. The ability to make the system HA simply requires such scope and authorization that a vendor basically can never do it unless that vendor is an IT outsourcer, in which case, we call it "doing" not "buying" because it is the same as internal staff and not productized.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

No, the solution provides protection against a hypervisor failure. You also want to protect against switch failure - buy another switch, you also want to protect against storage failure - make the storage HA. oVirt isn't a networking or storage platform, it uses storage and networking. What oVirt does do is run VMs and control hypervisors. So if a VM dies or a hypervisor dies - oVirt will provide the HA.

No, oVirt provides the mechanism, like FT. oVirt cannot guarantee HA. If you are looking at HA as a mechanism, then I can see why this would seem to make sense. But if you look at HA as "high availability", literally availability that is higher than normal, then it is clear that it can't be a mechanism and that any given mechanism can help to achieve it, but can't provide it itself.

So if a VM dies, oVirt will provide non-FT failover, yes. And by having failover you might achieve HA. And oVirt is a critical part of making that possible. But it itself isn't HA, nor does it guarantee HA. It's just a failover component that you can use to "do" HA.

dyasny

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

FT does mean instant "transparent" failover, but it isn't without risk. If it was, Boeing wouldn't be having the issues it has today (literally today.) You are right, HA is about improving A, nothing more. It has to be better than SA (standard A) but doesn't imply FT or anything like FT.

It means transparent failover for vmware, but in general in means the service can tolerate a certain amount of resources failing without having downtime.

Ive said many times, a really great vendor hardware agreement can provide HA pretty cheaply in the right circumstances (in Manhattan, HPE has a response time of around 15 minutes to get hardware into datacenters, for real!) That won't work on a farm in Iowa, but it can work in Manhattan.

They won't ship anything until they've done at least a bit of diags anyway, so those 15 minutes aren't really 15 minutes. But that's also nowhere near the topic for this conversation

HA is achieve in any way that results in HA as the final product. FT is a specific type of thing for achieving HA (that's its purpose, better availability) but is very specific in how it tries to do it.

The key difference is...

HA is a concept or term for resulting risk or projected availability. It's just a "rating".
FT is a mechanism or family / class of mechanisms whose intended purpose is to get you to or beyond HA. But if FT doesn't get you HA, and it often doesn't, it doesn't make it not-FT, it just makes it not HA.

Or to put it another way, HA is an end and FT is a means. FT's only goal isn't HA, it is also no service interruption which is not addressed in HA at all. But FT can still fail, and in some cases fail big time, and that can lead to missing HA (or even SA) even with FT.

FT and HA are different techniques achieving different goals. Just like, for example, backup and raid. Yes, in certain situations, FT can replace HA, but that doesn't make one a subset of the other, it makes one a potential replacement for the other, and quite often the replacement isn't the right thing to do.

You can't really buy HA for that. You need to encompass not just the storage and app server, but the network it is attached to, the datacenter(s) it is in, the data shuffling between data centers, the generators, the people and processes around all of it, maintenance schedules, and on and on. Theoretically you can buy anything as a service, but in reality, you have to essentially outsource IT, HR, Facilities, Purchasing and maybe some others to get HA for the service as there are so many internal pieces that have to come together to make it all work. The ability to make the system HA simply requires such scope and authorization that a vendor basically can never do it unless that vendor is an IT outsourcer, in which case, we call it "doing" not "buying" because it is the same as internal staff and not productized.

Well, I call anything you pay for "buying". Unless instead of money I give a goat away for it, then it's "bartering". Now, jokes aside, you keep to keep an eye on the scope for the HA and the budget for it. You can, of course, cover every possible corner case failure scenario and buy more tech to protect against it, but you can only go so far. Which means, following your own logic, that nothing you ever buy or "do" will ever give you real true veritable HA, only slightly hightened A, and no more, because there's always the ever fleeting chance of yet another potential disaster that can still cause an outage no matter what you do.

Starting small, you buy a server and install an app. The app has a watchdog to make sure it hadn't crashed and start it again if it has. That's a very minor HA protecting against a very small scope of failure. You set up raid1 in that server - you just increased the HA by covering disk failure. Keep growing that by small steps, and you have a cluster with redundant everything and DR sites across the universe, everything replicating using an ansible device and whatnot. Whoops, a new big bang happens, wiping out the universe (albeit very slowly), that's an event you still haven't covered, so, following your logic - you never had HA.

dyasny

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

So if a VM dies, oVirt will provide non-FT failover, yes. And by having failover you might achieve HA. And oVirt is a critical part of making that possible. But it itself isn't HA, nor does it guarantee HA. It's just a failover component that you can use to "do" HA.

It's a component you buy (well, this one is opensource, but still), part of a solution that would cover other potential failure points. The solution in general is also something you can buy, in order to achieve a certain level of HA for those VMs