StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far)

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Exactly. SDS or SANs - you can pick whatever you want and suits you best. But if you are going to be running a replicated distributed storage service on hardware that is already quite busy running VMs, you'll end up in trouble as soon as you go anywhere near capacity.

Maybe you are running into these problems due to bad products, planning or design, but you are hoisting problems you've had onto everyone else where we aren't experiencing these problems.

You are "projecting". I'm not saying you or people you know having implemented HC and had it not work well. But you are confusing a bad implementation or planning with the architecture being bad. The two are different things.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

@scottalanmiller said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

HCI doesn't hold up in the real world. See, I can keep doing this too

Except it does, tons of places are using it, and show me ANY example where it didn't shine... any. WHile somewhere, one must exist, dollars to donuts you can't find one.

I've worked with several Openstack and K8s clusters where the storage was local to the hypervisors, served from Ceph/Gluster/AFS/SheepDog. Horrible experience each time.

Right, that's my point, doing things badly doesn't mean that design is wrong. Your logic doesn't make sense... no matter how many times you fail at something doesn't mean that the task can't be done. It just means that you did it poorly, or wrong, or something.

Ceph and Gluster are both known to be bad for that, that should have been known going into the project. That someone didn't head you off at the pass shows that the mistakes and oversights were happening early on. We could easily have warned you that that wasn't meant to have good local performance.

So you are proving my point over and over again... you are using clearly wrong tech and looking at failures of one thing and misassociating them with another.

DustinB3403

@scottalanmiller said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

The two are different things.

Like hugging your girlfriend while driving your car.

Both can be done poorly.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

I'm not talking about simple software RAID, I'm talking about keeping a distributed storage system in sync, and then on top of that keep constantly rebalancing to satisfy the local tiering requirements.

Okay... so thebig question is... since this is not part of HC... why? Stop doing that. You can never have a rational, useful discussion about HC until you talk about HC and not something else.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

So to skip the rest of your quotes, in general, what you are saying is that a system which is essentially something like a simple KVM with DRBD, is the perfect solution. I am saying sure, for two nodes. How about 200?

No one anywhere recommends 200 nodes in a single cluster. If you think this is a good idea, SAN, SDS, HC, or otherwise, we are on different pages. That's a scale that literally no one, not MS, not StarWind, not VMware, not RedHat recommends as a single failure domain. DO they support it? Yup. Do they think you are crazy? Yup.
Even in the enterprise, which you claim to know, they often use workloads scopes of this size for performance and safety. The larger the pool, the bigger the problems.
If you want to get into giant pools you have to pick your battles. Let's talk reasonable size, like 10-80 nodes. If you need screaming performance that no SAN can match, then you are looking at StarWind network RAID which does that. If you just want cheap pooled storage, you look at CEPH (and there are accelerators for that to make it fast,if you need to). If you want a blend of things you might look at VMware, Scale, or HPE Simplivity depending on your mix of needs. But good solutions exist at pretty much any scale, that meet or exceed the alternatives. Nothing is perfect, as with all things, the goal is to find what is best.

At truly giant scale, the only real benefit to totally external storage, is when speed and reliability are so unimportant that you are willing to sacrifice them to a huge degree to save a few dollars. But with HC's low cost today, even that is approaching the impossible.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

How exactly can there be no overhead, when you are synchronizing large chunks of blocks over the network? Especially with high replication factors? Add encryption to that, add all the extra logic for local tiering, and there's no wonder the minimal sizing for SDS is so high.

Because you don't have all those things. Large chunks over the network takes like no resources. No idea where you think the overhead comes from, but for most of us, things like copying data is not a high overhead activity. It's a dedicated network in most cases, with offload engines on the NICs, and things like tiering and such take extremely little overhead (if done well.) These just aren't CPU or RAM intensive activities.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Do you think AWS/GCP/Azure are running HCI solutions for example?

Those aren't HA, so not applicable to the discussion. HCI is assumed to be replicating to other nodes, something those providers don't provide. They are stand alone compute nodes. Very different animal.

One minute you are assuming tiering, HA, and all kinds of things in your definition of HCI. Then the next you are skipping all of that and talking about players like these. You are jumping around with your definitions.

dyasny

@scottalanmiller said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Maybe you are running into these problems due to bad products, planning or design, but you are hoisting problems you've had onto everyone else where we aren't experiencing these problems.

You are "projecting". I'm not saying you or people you know having implemented HC and had it not work well. But you are confusing a bad implementation or planning with the architecture being bad. The two are different things.

Or maybe I'm just speaking from experience, and there's plenty of it. Local storage that is not shared is great, but it doesn't scale and pretty much kills all the nice features you can have in a virtualized DC - live migration, HA, all those things you don't care about in SMBs I suppose. Getting that storage replicated in a scalable fashion is hard, simple CBT pushed over network (what the folks at Linbit basically do) does not scale. And hard tasks require resources, those resoutces have to come from somewhere.

Ceph and Gluster are both known to be bad for that, that should have been known going into the project. That someone didn't head you off at the pass shows that the mistakes and oversights were happening early on. We could easily have warned you that that wasn't meant to have good local performance.

Mixing any distributed storage solution with any other workload is known to be bad, this is exactly what I'm saying. I've come into those projects when they were already implemented and got things working by breaking up those overloaded hosts into hardware that was doing one job and doing it well on either side.

Okay... so thebig question is... since this is not part of HC... why? Stop doing that. You can never have a rational, useful discussion about HC until you talk about HC and not something else.

But I am, at least at scale. DRBD and any similar system does not scale. When things are small (SMB level again) this is peanuts, we can do anything because our tasks are smaller than the hardware we can get. What happens at scale though?

No one anywhere recommends 200 nodes in a single cluster. If you think this is a good idea, SAN, SDS, HC, or otherwise, we are on different pages. That's a scale that literally no one, not MS, not StarWind, not VMware, not RedHat recommends as a single failure domain. DO they support it? Yup. Do they think you are crazy? Yup.

200 nodes is small for the scale I typically deal with. Red Hat has solutions that can deal with this kind of scale easily. I know of a few other companies that do. MS, VMW and probably StarWind do not, because of the nature of their clustering implementation, but that's basically all about how you manage locking.

Even in the enterprise, which you claim to know, they often use workloads scopes of this size for performance and safety. The larger the pool, the bigger the problems.

Not really. In a large pool, a dead node simply gets easily replaced. The effect is very small.

If you want to get into giant pools you have to pick your battles.

I usually am in those numbers, but ok

Let's talk reasonable size, like 10-80 nodes. If you need screaming performance that no SAN can match, then you are looking at StarWind network RAID which does that.

OK, so we have a network RAID, a bunch of blocks get streamed to other nodes when writes occur on one. When all there is is pushing block across, things are simple. What happens when a node dies, and I have to suddenly rebalance the data distribution? How is consistency kept? How does the system decide which blocks get streamed where? Even in a 10 node cluster, it would be plain out stupid to keep all the data replicated to everywhere, 10x the data on local disks would be too expensive

If you just want cheap pooled storage, you look at CEPH (and there are accelerators for that to make it fast,if you need to).

Here we have a distributed system, which starts at a least a core per RBD, and 32Gb or RAM to even get started properly. In SMB, I doubt you see many monstrous hypervisors with hundreds of cores, so what is there left to run your actual VMs?

At truly giant scale, the only real benefit to totally external storage, is when speed and reliability are so unimportant that you are willing to sacrifice them to a huge degree to save a few dollars. But with HC's low cost today, even that is approaching the impossible.

Only having a good storage fabric can give you excellent speed and very low latencies, and as for reliability - you can build whatever you want on the SAN side depending on your requirements. The only good thing about HC is local storage access, and it isn't really that far ahead of any decent fabric anyway, if at all.

Because you don't have all those things. Large chunks over the network takes like no resources. No idea where you think the overhead comes from, but for most of us, things like copying data is not a high overhead activity. It's a dedicated network in most cases, with offload engines on the NICs, and things like tiering and such take extremely little overhead (if done well.) These just aren't CPU or RAM intensive activities.

That is simply not true. Pushing large amount of data over the network is not cheap, and that is in the case of simple streaming. When you start running synchronizations and tiering stuff gets harder. And when you have to rebalance (which ceph does often) you need even more resources. Yes, you can dedicate NICs to just that (and those NICs will not be there to provide more bandwidth to the workload traffic) but in order to push large amounts of data into the NICs you also need CPU cycles and and RAM. It's CS 101, there are no free rides.

Those aren't HA, so not applicable to the discussion. HCI is assumed to be replicating to other nodes, something those providers don't provide. They are stand alone compute nodes. Very different animal.

My point exactly. If HC was so great, why wouldn't they be using it?

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Or maybe I'm just speaking from experience, and there's plenty of it. Local storage that is not shared is great, but it doesn't scale

Right, but we aren't talking about not sharing it. So, again, you are talking about something different. I'm not sure where you are getting lost, but you are talking about totally different things than everyone else. This has nothing to do with the discussion here.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Getting that storage replicated in a scalable fashion is hard, simple CBT pushed over network (what the folks at Linbit basically do) does not scale. And hard tasks require resources, those resoutces have to come from somewhere.

Again, didn't scale for you, but your failures do not extend to everyone else. I'm not sure why you feel it can't scale, but it does successfully for others. The common factor here is "your attempts have failed". You have to stop looking at that as a guide to what "can't be done."

That logic is like me claiming that humans can't speak Chinese or do pull ups because I can't do them. Yet anyone can see that a billion or more Chinese people can speak Chinese and billions of healthier than me people can do pull ups. Arguing that they aren't really able to them, even though everyone can see them doing them, because I can't do them, is clearly crazy.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Mixing any distributed storage solution with any other workload is known to be bad, this is exactly what I'm saying. I've come into those projects when they were already implemented and got things working by breaking up those overloaded hosts into hardware that was doing one job and doing it well on either side.

Again... you are not understanding that because something can be bad doesn't mean it is always bad. This are basic logical constructs. You are missing the basic logic that absolutely no amount of observation of failures makes other people's observations of success impossible.

You use the logic of "can" to mean "has to". But that logic, our observation of HC working well would mean that HC always works no matter how badly implemented. neither makes the slightest sense.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

OK, so we have a network RAID, a bunch of blocks get streamed to other nodes when writes occur on one. When all there is is pushing block across, things are simple. What happens when a node dies, and I have to suddenly rebalance the data distribution? How is consistency kept? How does the system decide which blocks get streamed where? Even in a 10 node cluster, it would be plain out stupid to keep all the data replicated to everywhere, 10x the data on local disks would be too expensive

You are assuming automated rebalance. I can't believe that I have to do this again but, that's not part of HC. That's a great feature and useful in some cases, but comes at high cost and is just one of many optional things you might want to do when you have HC. You can't continue to have this discussion until you separate what you have made up to be part of HC and what actually is.

In many cases with network RAID, rebalancing is insanely straightforward, because it goes back to the replacement node. Just like replacing a single RAID drive. It's self explanatory. Could you do something else that is more complex? Sure. Might it be good? Sure. Is it implied? No.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Even in the enterprise, which you claim to know, they often use workloads scopes of this size for performance and safety. The larger the pool, the bigger the problems.

Not really. In a large pool, a dead node simply gets easily replaced. The effect is very small.

So you don't understand the pool risks and think that node risks alone exist and that the system as a whole carries no risks? This would explain a lot of the misconceptions around HC. The cluster itself carries risks, it's a single pool of software. Every platform vendor will tell you the same.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Only having a good storage fabric can give you excellent speed and very low latencies, and as for reliability - you can build whatever you want on the SAN side depending on your requirements. The only good thing about HC is local storage access, and it isn't really that far ahead of any decent fabric anyway, if at all.

Actually, that breaks the laws of physics. So obviously not true. SAN can't match speed or reliability of non-SAN. That's pure physics. You can't break the laws of math or physicals just by saying so.

HC has EVERY possible advantage of SAN by definition, it just has to there is no way around it, but adds the advantages of reduction in risk points and adds the option of storage locality. Basic logic proves that HC has to be superior. You are constantly arguing demonstrably impossible "facts" as the bases for your conclucsions. But everyone knows that that's impossible.

scottalanmiller

basically we are having a time warp discussion back to 2007 when almost everyone truly believed that SANs actually were magic and did things that could not be explained or done without the label "SAN" involved and that physics or logic didn't apply. But in 2007, it was considered normal to think of a SAN as being magic. In the twelve years since, this same discussion has been had hundreds or thousands of times. The basics haven't changed. But finally people have started to realize that SANs are not magic at all, but just a server with drives in it, often doing an awful job with high failure rates and ridiculous costs. The world has moved on and nearly everyone understands this now. But this discussion is going right back to 2007 and acting as if SANs were still believed to be magic and that over a decade of general storage understanding hasn't happened.

I get it, storage can be confusing. But arguing against 15+ years of information that is well established and just acting like it hasn't happened and just reiterating the myths that have been solidly debunked and ignoring that this is all well covered ground just makes it seem crazy.

dyasny

@scottalanmiller said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

Right, but we aren't talking about not sharing it. So, again, you are talking about something different. I'm not sure where you are getting lost, but you are talking about totally different things than everyone else. This has nothing to do with the discussion here.

Good, then we are at least partially on the same page

Again, didn't scale for you, but your failures do not extend to everyone else. I'm not sure why you feel it can't scale, but it does successfully for others. The common factor here is "your attempts have failed". You have to stop looking at that as a guide to what "can't be done."

OK, what is the largest cluster size you can run with this starwind solution reliably? Their best practice doc is very careful about mentioning scale being a problem, although they call it "inconveniences".

Again... you are not understanding that because something can be bad doesn't mean it is always bad. This are basic logical constructs. You are missing the basic logic that absolutely no amount of observation of failures makes other people's observations of success impossible.

I am talking about a very basic thing - storage tasks require resources. Those resources need to come from somewhere. If you don't use dedicated boxes, you have to take resources away from your VMs. It is extremely simple.

You are assuming automated rebalance.

Automated or manually triggered - it's a costly operation. Even if you don't run a sync cycle but do a dumb data stream from a quiesced source, you will be pushing lots of data over several layers of hardware and protocol, that does not come for free. When you replace a disk in a RAID array, you are going to suffer from performance degradation until the raid is in sync, because the hardware or software raid system will be working hard to push all the missing data to the new disk in the best case scenario, and will be generating a ton of parity and hashes in the worst. This does not come cheap.

So you don't understand the pool risks and think that node risks alone exist and that the system as a whole carries no risks? This would explain a lot of the misconceptions around HC. The cluster itself carries risks, it's a single pool of software. Every platform vendor will tell you the same.

I understand the risks, and losing just a storage node or just a hypervisor node is much less risk than losing both at once. I was hoping you would understand that, but I guess I shouldn't hope.

Actually, that breaks the laws of physics. So obviously not true. SAN can't match speed or reliability of non-SAN. That's pure physics. You can't break the laws of math or physicals just by saying so.

Really? FC at lightspeed from a couple of yards away is significantly slower than local disk traffic? Are you sure we have the same physics in mind?

HC has EVERY possible advantage of SAN by definition, it just has to there is no way around it, but adds the advantages of reduction in risk points and adds the option of storage locality. Basic logic proves that HC has to be superior. You are constantly arguing demonstrably impossible "facts" as the bases for your conclucsions. But everyone knows that that's impossible.

You keep talking about your assumptions as if they are the one and only possible truth. They are not. HC cannot have the advantages of a SAN because the SAN is more than just a big JBOD (and even if it were, it has the advantage of being a much larger JBOD than you could ever hope to build on a single commodity server). A SAN has tons of added functionality which it deals with without loading the hosts. If you start implementing all of that in HC, you end up spending even more local host resources on non-workload needs. So either your "basic logic" is flawed, or you simply aren't able to accept that there might be points of view besides yours.

basically we are having a time warp discussion back to 2007 when almost everyone truly believed that SANs actually were magic and did things that could not be explained or done without the label "SAN" involved and that physics or logic didn't apply.

I'm not the one talking about "magic sauce" here, remember? I am actually talking about implementation specifics and how they are not simple (because I know these details and technologies well enough to discuss them and see no magic in them)

I get it, storage can be confusing. But arguing against 15+ years of information that is well established and just acting like it hasn't happened and just reiterating the myths that have been solidly debunked and ignoring that this is all well covered ground just makes it seem crazy.

Have you noticed how you never have any real arguments instead what I see is "this is known for N years!" and "this is the one and only logic!"? I get it, being defied with solid technical arguments can be confusing, but please try to bear with me here instead of just defaulting the the usual non-arguments. Can you explain to me, how keeping large storage volumes synchronized over a network has no overhead and consumes no host resources please? It's a simple question, and I will not accept "magic" as an answer. Saying that pushing large amounts of data across a network comes at no cost is pretty much defying the laws of physics, so I'd like to know how exactly you expect to circumvent them.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

OK, what is the largest cluster size you can run with this starwind solution reliably? Their best practice doc is very careful about mentioning scale being a problem, although they call it "inconveniences".

Bigger than the platforms suggest to go. Starwind has no limits. Its' Vmware, KVM, Hyper-V, etc. that have the limits. Because of how Starwind works, they don't carry the single domain scaling limits in their own stuff. So anything you are looking at with any architecture would have the same or smaller limits, literally anything.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

I get it, being defied with solid technical arguments can be confusing

Fifteen years of providing those, and you are just ignoring them. I get it, but you can't just act like we've not established this all already. You aren't refuting the tech details, you are acting like they haven't been established for forever.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

I am talking about a very basic thing - storage tasks require resources. Those resources need to come from somewhere. If you don't use dedicated boxes, you have to take resources away from your VMs. It is extremely simple.

But not more resources, or meaningfully more, than the resources needed to pass that off elsewhere. And the cost of doing that has to come out of money that could be spent on local performance.

The way that you word this makes it sound reasonable, but it's not actually necessarily true.

Storage resources aren't very big in most cases, to a point where addressing them is foolish. You are literally arguing that software RAID uses a lot of resources, when since 2000, it's accepted that it doesn't. You need to provide details as to why you feel the entire industry has for two decades accepted and understood and demonstrated one thing isn't correct and that you have some secret information about how all of that is somehow wrong.

You need to show where this overhead that no one else sees is coming from. What is creating it? Why can't anyone else see it? Why are only you affected? Why is it a big deal if we can't see it or measure it? Why have all studies for twenty years showed one thing, and where is your evidence that it is all wrong?

The logic that storage overhead comes from the VMs is the same logic that mislead people about hardware RAID... those resources have to come from the CPU. That's true. But it's also the wrong way to think about performance. What matters is resulting performance of the VMs or workloads. You are getting under the hood and missing the big picture. With hardware RAID what was found is that the overhead to do it with the more powerful central CPU in essentially all cases was so low that the performance advantage went to software RAID because it was a tiny bit faster, with nominal overhead that was "spare", and in the extremely rare case that it was not, it was vastly cheaper to increase the central CPU and/or RAM capacity than it was to purchase hardware RAID for offloading.

Consider that "storage must have overhead" and "overhead must come out of the VMs" are true statements that feel like they must result in "therefore that slows the VMs down" when we know that that is not the necessary answer. It's a possible answer, but those facts aren't a complete picture and are meaningless on their own. And the industry has proven this for decades with solid math, logic, and provable observation. This isn't some crazy theory, this is IT basics as has been known across the board since the Pentium IIIS processors were released and showed it to be true and has only become more common in the decades since.

scottalanmiller

@dyasny said in StarWind HCA is one of the 10 coolest HCI systems of 2019 (so far):

You are assuming automated rebalance.

Automated or manually triggered - it's a costly operation.

yes, but its' a cost that has to exist regardless. So in this context, it has no extra cost.