New Infrastructure to Replace Scale Cluster

Dashrender

@dyasny said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller you're obviously not inventing the term HA, but you are inventing this ridiculous saying about HA being what you do and not what you buy. You can, of course, hack HA into almost any service, but a product that is already built with HA in mind is something you buy and use as designed - and you get HA. Out of the box, if you bought and configured all the prerequisites. oVirt, vCenter, and a ton of other products have it designed into them, so if you pay for it, and for the hardware that supports it, you can have it right there out of the box, if you follow the setup guide. Everything else is just you throwing meaningless pronouncements in the air.

You buy VMware, with the HA features (don't remember if those cost extra, doesn't matter here). You buy hardware that supports whatever VMWare uses for HA (IPMI/redfish/redundant switches etc - whatever is the best practice) and you follow the config guide to set it up - you have yourself highly available VMs, with all the standard properties for HA - downtime SLA, splitbrain avoidance etc etc. These are features you pay for (that's what "buy" means in the English language), both on the software and hardware side of things.

And yes, I've decided arguing with you here is a huge waste of time, because for every comment you come back with 10, and I have no bandwidth for replying to that much, so if you think you "won" an argument or whatever tickles your fancy, sure, go ahead. I'll just answer if I want to, at my own convenience. Hope you don't mind.

In general "buying" HA is generally though of as just by VMWare with the HA feature and tada you have HA. which of course is wrong. If you don't have redundant power and redundant switches, and storage, and HVAC, etc, etc - then you don't have real HA - you have one tiny piece that's HA, but you don't have HA for the the likely end goal.

At least this is what I take from Scott's comments.

VMWare won't sell you the solution that includes all the switches and internet and power and HVAC, etc, etc.. they will only sell you the things they sell. But HA requires so much more than they can provide.

dyasny

@Dashrender said in New Infrastructure to Replace Scale Cluster:

In general "buying" HA is generally though of as just by VMWare with the HA feature and tada you have HA. which of course is wrong. If you don't have redundant power and redundant switches, and storage, and HVAC, etc, etc - then you don't have real HA - you have one tiny piece that's HA, but you don't have HA for the the likely end goal.

Everything you mentioned is something you can buy, and is usually specified as a prerequisite for an HA solution.

At least this is what I take from Scott's comments.

VMWare won't sell you the solution that includes all the switches and internet and power and HVAC, etc, etc.. they will only sell you the things they sell. But HA requires so much more than they can provide.

I frankly don't remember the last time I actually bought anything from a vendor. People usually go to an integrator, and pay for building a solution for them. That integrator should provide all the prerequisites, and you pay for a solution not a single standalone product.

Everyone wants to sell solutions, not products, to the point where these solutions in turn become productized. And an HA solution can be bought.

Now, if we stop the pissing contest (yes Scott, you are the god of IT, you are always right and everyone else is always wrong, even when you don't drop your 2c in every conversation) and think rationally for a second here, we are talking about a small setup with just a few hosts. We already established we have brand name hardware, which means IPMI can be used as SBA. Out of the box with oVirt, this provides us with the ability to make VMs highly available, just configure the hosts, and mark the VMs as HA, nothing else. Assuming networking was done properly, and we have switch redundancy, and hoping the building has a generator and the storage is also reliable is nice, but out of scope for this particular scenario. All the OP wanted was VM HA, where if a VM dies or the host carrying it stops being able to host it, it gets safely started elsewhere. It isn't hard to grasp the concept, really.

Dashrender

@dyasny said in New Infrastructure to Replace Scale Cluster:

@Dashrender said in New Infrastructure to Replace Scale Cluster:

In general "buying" HA is generally though of as just by VMWare with the HA feature and tada you have HA. which of course is wrong. If you don't have redundant power and redundant switches, and storage, and HVAC, etc, etc - then you don't have real HA - you have one tiny piece that's HA, but you don't have HA for the the likely end goal.

Everything you mentioned is something you can buy, and is usually specified as a prerequisite for an HA solution.

At least this is what I take from Scott's comments.

VMWare won't sell you the solution that includes all the switches and internet and power and HVAC, etc, etc.. they will only sell you the things they sell. But HA requires so much more than they can provide.

I frankly don't remember the last time I actually bought anything from a vendor. People usually go to an integrator, and pay for building a solution for them. That integrator should provide all the prerequisites, and you pay for a solution not a single standalone product.

Everyone wants to sell solutions, not products, to the point where these solutions in turn become productized. And an HA solution can be bought.

Now, if we stop the pissing contest (yes Scott, you are the god of IT, you are always right and everyone else is always wrong, even when you don't drop your 2c in every conversation) and think rationally for a second here, we are talking about a small setup with just a few hosts. We already established we have brand name hardware, which means IPMI can be used as SBA. Out of the box with oVirt, this provides us with the ability to make VMs highly available, just configure the hosts, and mark the VMs as HA, nothing else. Assuming networking was done properly, and we have switch redundancy, and hoping the building has a generator and the storage is also reliable is nice, but out of scope for this particular scenario. All the OP wanted was VM HA, where if a VM dies or the host carrying it stops being able to host it, it gets safely started elsewhere. It isn't hard to grasp the concept, really.

Started? That's not HA. At least at this very moment I wouldn't consider it HA, I'd consider it SA (Standard Availability).

As for your reasoning that people buy solutions - oh, if only that were true more often than not. But one look at Spice--- oh you know that place, you can see that people buy a SAN and think they have HA. period.

Also to more points you made - You simply can't have HA without having ALL of those other parts. It's great that you have HA at the server level, but what are the chances that's where your issue is going to be and not at the electrical power level? It's great that the servers have HA, but if your internet doesn't is the solution really HA?
No - it's not.

It's sudo HA. or more likely - simply SA or in the worst setup - LA.

dyasny

@Dashrender said in New Infrastructure to Replace Scale Cluster:

Started? That's not HA. At least at this very moment I wouldn't consider it HA, I'd consider it SA (Standard Availability).

It doesn't matter what you consider. Automatic monitoring of a service and making sure it always runs (so if it stops running, the system starts it) is the definition of HA. Nobody ever promised 0 downtime, just a lot of 9's after the dot. Don't confuse HA with FT

As for your reasoning that people buy solutions - oh, if only that were true more often than not. But one look at Spice--- oh you know that place, you can see that people buy a SAN and think they have HA. period.

I don't care what people think, there are standards and definitions available.

Also to more points you made - You simply can't have HA without having ALL of those other parts. It's great that you have HA at the server level, but what are the chances that's where your issue is going to be and not at the electrical power level? It's great that the servers have HA, but if your internet doesn't is the solution really HA?
No - it's not.

HA is about avoiding a failure. The best HA solutions target all possible points of failure, and how many of those you want to cover is up to you and your budget. No solution is ever perfect, even the solutions that are meant to address solution imperfections

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

It doesn't matter what you consider. Automatic monitoring of a service and making sure it always runs (so if it stops running, the system starts it) is the definition of HA. Nobody ever promised 0 downtime, just a lot of 9's after the dot. Don't confuse HA with FT

That might be a way to achieve HA, but it isn't the definition of it. The definition of it comes from the level of availability, nothing to do with the mechanisms or attempts at it. You are correct that people often confuse HA (a level of availability) with FT (a physical type of protection whose purpose is to help provide HA, or at least "better A".)

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

you're obviously not inventing the term HA, but you are inventing this ridiculous saying about HA being what you do and not what you buy.

No, I've clearly quoted the source on that every time. It's @StorageNinja

But it is also incredibly true. You can't "just buy" HA, doesn't work. No product anywhere can do HA if you don't treat it properly or make the things around it support HA as well. It doesn't require "inventing anything", it's just obvious common sense.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

Everyone wants to sell solutions, not products, to the point where these solutions in turn become productized. And an HA solution can be bought.

Sure, but that is 1) almost always untrue, very few vendors actually do anything to achieve HA 2) when they do, they are becoming the IT department and building, not buying, HA.

Basically what you are doing is agreeing with John, but playing a services semantics game to look at it from a CEO's perspective that he can "buy an IT department that will be tasked with implementing HA". So if you consider your staffing, hiring, and business decisions to be "something you can buy like a product", then sure. But all you've done is define "anything you do" as something purchasable.

DustinB3403

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

JaredBusch

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

DustinB3403

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

Well now I know where my paycheck is going to!

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

Out of the box with oVirt, this provides us with the ability to make VMs highly available, just configure the hosts, and mark the VMs as HA, nothing else. Assuming networking was done properly, and we have switch redundancy, and hoping the building has a generator and the storage is also reliable is nice, but out of scope for this particular scenario. All the OP wanted was VM HA, where if a VM dies or the host carrying it stops being able to host it, it gets safely started elsewhere. It isn't hard to grasp the concept, really.

See, this is where it all falls apart. Your example proves our point. In your example with the IPOD, you aren't even trying to make HA. oVirt's HA option is actually terrible here because it tricks the humans into thinking there is protection where there is not and makes people often (where makes = allows their brains to accept) introduce more risk, lowing their availability below standard, by seeing the term HA applies to one isolated layer of the stack and ignoring the increased risk of the overall stack.

It's not just an example of how HA fails, it is THE example that we've been talking about for a decade. It's the "buy the book" worst possible setup where you add lots of hardware, add "HA products", and the result is a system that is slower and more fragile than if you hadn't done it at all. It's the exact scenario we did the risk analysis on a few years ago, the one that has been beaten to death.

If you feel the math is wrong, go after that. But you are just repeating the identical arguments that people on always did and doing the same "not looking where the problem is" and instead looking at the term HA and ignoring that to turn it on you had to create a huge amount of risk that wasn't there originally. It's all the standard marketing trick.

There is no pissing match, this is tried and true, obvious and well known problems with HA marketing. It's an example that is trivial to show proves the point (and has been, SO many times.) And the only thing going on is you are making a fresh argument for something that was long ago shown to be LA and acting like we don't all already know this. Your statements about it suggest that you either haven't read about it and haven't examined the risk in whole, or you are knowingly ignoring the body of work on it and thinking that if you pretend the evidence hasn't been presented that rehashing it will cause us to forget where the risk is.

Why you are arguing the point in this way doesn't make sense. Because you aren't stating where the math is wrong, you are just ignoring that math has been presented over and over again and no one in a decade has ever ended up disagreeing with the evidence. So it is not how to present a logical argument.

scottalanmiller

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

And you can then call it something "you buy" rather than "something you do".

scottalanmiller

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

Well now I know where my paycheck is going to!

To be used as toilet paper?

DustinB3403

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

Well now I know where my paycheck is going to!

To be used as toilet paper?

To pay people to do things that I do myself!

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

I frankly don't remember the last time I actually bought anything from a vendor. People usually go to an integrator, and pay for building a solution for them.

Integrator is industry speak for the vendor advocate. When anyone in IT says vendor, they mean integrator. It's just accepted that they are the channel arm for their vendors and to the IT side they are one and the same. Both are vendor advocates, both are sales people, one just repeats the marketing of the other.

Dashrender

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

But all you've done is define "anything you do" as something purchasable.

To be the devil's advocate

So you can pay someone to breath for you? To flush the toilet for you? To wipe for you?

Yes, yes, and yes.

Well now I know where my paycheck is going to!

To be used as toilet paper?

LOL - nice!!
I see what you did there.

scottalanmiller

The marketing trick being used here is "shifted risk." It's a "slight of hands" way to make something look like HA and it is the most common approach used to sell "HA equipment" that isn't actually HA. It's popular because it provides and affordable setup, at great margins, and high cost, but not so high as to break the customer's bank. But it cost quite a bit more than a true HA setup, and so is popular as a way for vendors (and integrators) to raise the costs without raising them beyond acceptable limits.

The trick is that they identify the risks of the initial system. They call it "the app server" to begin the trick. In a single server setup, this isn't correct, the single server is the "entire physical system", you can't isolate one part of it like that, but they do and hence the trick begins.

Next they say "how do we eliminate this risk"? Answer: make it redundant. Another slight of hand, redundant means "an extra thing" not "it is fault tolerant" but the average person thinks that they must mean the later and hears that in their head regardless of what is actually said. Of course this is wrong, because the risk is of the entire system, not just one piece, but they only offered to make one piece redundant. In theory, everyone should know that the thing is a trick at this point, but people always want to give the benefit of the doubt because hearing that "HA can be bought" triggers an emotional reaction and makes us hope that it is really so easy and automatic that it requires nothing from us.

Next the big risks, the storage and other system components are shifted out of the single box and put somewhere else with a "no need to look over there, it's magic" attitude and people happily agree that since storage is hard, they won't look any further. Every integrator pulls this trick, this is where the money is. So people assume that since everyone does it, it must be right.

Then the integrator adds in more risk by needing switches. But they, again, simply raise the cost by saying it needs to be redundant and since it is redundant, ignore that it carries risks too.

The end result is the integrator gets to sell not just the two servers the customer didn't need (because if they bought this, they clearly didn't need HA at all), but also a third server, and two extra switches, and a lot of set up man power. The vendor piece of the equation is thrilled because they easily double (or more) their hardware sales. And the integrator piece is thrilled because they didn't just double their margins but they also use the unneeded complexity to push for a lot of integrator hours and ongoing support because they made something that was best left simple, into something really complex.

And all of this by somehow convincing people that "risk isn't additive", which is amazing that people can be tricked of that. The storage server, at the end of the day, carries literally all of the risk of the original server and isn't removed from the risk pool in any way. If we had kept our eye on the storage component of the original setup instead of being tricked by being directed to the application piece, it all becomes really obvious, really quickly. That's how the card trick is played - look at this motion while I hide the card you were trying to watch.

Then, once we take our eyes off of the "unaddressed initial risk", all of the other risk, the application servers, the switches, the cabling... is all "extra risk layered on top of the unaddressed initial risk." And math is math. More risk is riskier than less risk. It's literally that simple.

scottalanmiller

This IPOD setup, and the ~~vendors~~ integrators that sell it we called "the standard scam of SMB IT" a year ago.

Youtube Video

dyasny

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

That might be a way to achieve HA, but it isn't the definition of it. The definition of it comes from the level of availability, nothing to do with the mechanisms or attempts at it. You are correct that people often confuse HA (a level of availability) with FT (a physical type of protection whose purpose is to help provide HA, or at least "better A".)

No, FT is about zero service interruptions, you're basically running multiple copies of the same service in either standby or load-balanced manner, and if one of a few (depends on the SLA) service instances go down, overall service availability is not harmed. With HA, the service availability is higher than with none, but it's not meant to be zero downtime in case of failure. All HA does is monitor the service and use technical means of making sure it stays up as much as possible, by failing the service over to another resource (a DR site, a standby host, a DC on Mars - doesn't matter).

With today's advent of distributed services, FT is what is usually being run, since it does make more sense, but HA is still used for the bulkier services, like VMs or large service nodes (e.g. openstack controllers)

But it is also incredibly true. You can't "just buy" HA, doesn't work. No product anywhere can do HA if you don't treat it properly or make the things around it support HA as well. It doesn't require "inventing anything", it's just obvious common sense.

I don't see it that way. If you want HA for a service, you pay for a solution to make it HA, that includes all the components that will protect it from various types of failure as well as provide SBA (several layers of SBA if you have the money). But in the end, it all translates to dollars. You level of paranoia (or the list of stuff you want to protect the service from) vs your budget.

Sure, but that is 1) almost always untrue, very few vendors actually do anything to achieve HA 2) when they do, they are becoming the IT department and building, not buying, HA.

All the vendors dealing with IT infrastructure have tons of HA oriented solutions. From hardware vendors selling RAID controllers, power management devices and redundant PSUs, to software vendors building support for HA into their products. I find it hard to think of any IT product I've used recently that didn't have HA or FT built in.
That doesn't make any sense.

See, this is where it all falls apart. Your example proves our point. In your example with the IPOD, you aren't even trying to make HA. oVirt's HA option is actually terrible here because it tricks the humans into thinking there is protection where there is not and makes people often (where makes = allows their brains to accept) introduce more risk, lowing their availability below standard, by seeing the term HA applies to one isolated layer of the stack and ignoring the increased risk of the overall stack.

No, the solution provides protection against a hypervisor failure. You also want to protect against switch failure - buy another switch, you also want to protect against storage failure - make the storage HA. oVirt isn't a networking or storage platform, it uses storage and networking. What oVirt does do is run VMs and control hypervisors. So if a VM dies or a hypervisor dies - oVirt will provide the HA. If all you have are three hosts, what I advised on isn't the safest solution, but it is the easiest to build and run, as well as expand upon later, when budget is available for covering the SPOFs in the setup. It will also provide better performance than using gluster, and there might be more advantages in terms of disk space availability, depending on the hardware the OP has.

I'm not arguing it is safer this way, I'm saying it is easier to build and reasonably safe if backups are done, especially if there is more budget to come in later in the game. Something to get started with.

The user has to be aware of the SPOFs in the system, I'm not saying there are none. The user also has to work at eliminating them, but again, you're applying the logic of a large company with large resources to a tiny little shop with 3 machines. At this level, they might as well just run everything locally on 3 disparate hosts and be done, that's even easier. The point of oVirt here is to start a proper virtual DC from some small set of available boxes and grow it into a proper solution. Nothing involving 3 hosts and a switch can provide real HA in any case.

So please stop running away into depths that are irrelevant to this particular setup. We already know you're an IT bigshot, there's no need to keep showing off.

Integrator is industry speak for the vendor advocate. When anyone in IT says vendor, they mean integrator. It's just accepted that they are the channel arm for their vendors and to the IT side they are one and the same. Both are vendor advocates, both are sales people, one just repeats the marketing of the other.

Sure, but that integrator will do everything for you - networking, HVAC, servers, storage, software - all designed and built as per your requirements, as one single product - you DC.

The marketing trick ...

So much text with nothing actually said. That's actually a marketing trick. Or a politician's trick, whatever.

When you deal with an integrator/vendor/consultant/etc you go over the proposed solution, and it is up to you to see where you're being oversold on stuff and where the proposed parts of the system are actually needed. If you haven't done your due diligence - it's your fault and you should leave the profession, instead of blaming "marketing tricks". All this talk of marketing tricks is basically an attempt at shifting the blame at not having done proper due diligence when signing the purchase contract, nothing more.

When I did freelance DC design, I always pointed out the specific points which could be an SPOF and agreed with the customer on whether they want and can afford to address them or not, instead of just trying to milk them for money. That usually got my overall solution prices to be much lower than what large vendors and integrators offered. I ended up making less money, but having loyal customers who trusted me and kept coming back. If I just came in and started pushing a small enterprise into building wall street level solutions, I'd have been kicked out of the building. But when the solution was up, with known, budget related issues, we always had a plan for the next few years to address these issues, as well as a backup plan to protect the really important aspects of the business. In a few years, gradually, the issues got addressed and the setup grew more reliable, as budget permitted. Had I come in demanding everything was covered from the start, there wouldn't have been any setup to grow and improve in the first place.

Again, I'm not saying there is perfect HA in the solution, I'm not saying it's totally reliable. I'm saying it's a start with some protection against known failures, and a set of SPOFs to be aware of and address in the future. I'm not trying to sell any false promises. For a small business with 3 servers, this is as much as they can hope for anyway.

If you eliminate the storage SPOF, you still have a network SPOF in place, and you're running a badly performing storage system, driving the amount of network traffic up as well as tripling the storage consumption. And you're adding a lot more software into the mix, software that can break, have bugs etc etc etc. All of that without actually having the perfect HA - there's no redundant networking, no redundant HVAC and no DR site. OMG, we're all gonna die!

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

No, FT is about zero service interruptions, you're basically running multiple copies of the same service in either standby or load-balanced manner, and if one of a few (depends on the SLA) service instances go down, overall service availability is not harmed. With HA, the service availability is higher than with none, but it's not meant to be zero downtime in case of failure. All HA does is monitor the service and use technical means of making sure it stays up as much as possible, by failing the service over to another resource (a DR site, a standby host, a DC on Mars - doesn't matter).

FT does mean instant "transparent" failover, but it isn't without risk. If it was, Boeing wouldn't be having the issues it has today (literally today.) You are right, HA is about improving A, nothing more. It has to be better than SA (standard A) but doesn't imply FT or anything like FT.

Ive said many times, a really great vendor hardware agreement can provide HA pretty cheaply in the right circumstances (in Manhattan, HPE has a response time of around 15 minutes to get hardware into datacenters, for real!) That won't work on a farm in Iowa, but it can work in Manhattan.

HA is achieve in any way that results in HA as the final product. FT is a specific type of thing for achieving HA (that's its purpose, better availability) but is very specific in how it tries to do it.

The key difference is...

HA is a concept or term for resulting risk or projected availability. It's just a "rating".
FT is a mechanism or family / class of mechanisms whose intended purpose is to get you to or beyond HA. But if FT doesn't get you HA, and it often doesn't, it doesn't make it not-FT, it just makes it not HA.

Or to put it another way, HA is an end and FT is a means. FT's only goal isn't HA, it is also no service interruption which is not addressed in HA at all. But FT can still fail, and in some cases fail big time, and that can lead to missing HA (or even SA) even with FT.