New Infrastructure to Replace Scale Cluster

DustinB3403

@dyasny so to clarify for me, as i'm fighting a headache.
This design is similar to that of ESXi with vsphere.

In that you should have 3 physical hosts, and 1 of which is installed with the vSphere service.

Correct?

dyasny

@DustinB3403 no, in this particular setup, you have two options. The original one would be to go hyperconverged, installing both the storage and hypervisors services on all 3 hosts, and to also deploy the engine (vsphere equivalent) as a VM in the setup (that's called self hosted engine).

The better option, IMO, is to use two hosts as hypervisors, and the third - pack with disks, and use as the storage device (NFS or iSCSI). And also install the engine on it, as a VM or on baremetal - doesn't matter.

You will have less hypervisors, true, but having a storage service on the hypervisors is a resource drain, so you don't actually lose as much in terms of resources. And you gain a proper storage server, less management headache, and a setup that can scale nicely if you decide to add hypervisors or buy a real SAN. Performance will also be better, and you might even end up with more available disk space, because you will not have to keep 3 replicas of every byte like gluster/ceph require you to do.

Dashrender

@dyasny said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 no, in this particular setup, you have two options. The original one would be to go hyperconverged, installing both the storage and hypervisors services on all 3 hosts, and to also deploy the engine (vsphere equivalent) as a VM in the setup (that's called self hosted engine).

The better option, IMO, is to use two hosts as hypervisors, and the third - pack with disks, and use as the storage device (NFS or iSCSI). And also install the engine on it, as a VM or on baremetal - doesn't matter.

You will have less hypervisors, true, but having a storage service on the hypervisors is a resource drain, so you don't actually lose as much in terms of resources. And you gain a proper storage server, less management headache, and a setup that can scale nicely if you decide to add hypervisors or buy a real SAN. Performance will also be better, and you might even end up with more available disk space, because you will not have to keep 3 replicas of every byte like gluster/ceph require you to do.

Isn't that an IPOD though?

dyasny

@Dashrender said in New Infrastructure to Replace Scale Cluster:

Isn't that an IPOD though?

It's a server, not an ipod. If you mean SPOF, then yes, if the entire server just dies, you lose the cluster. Obviously, the server itself can be backed up, clustered, and installed with redundant components to avoid that. It's all a matter of balancing between budget, the admin's level or paranoia, and the desire to have a reliable setup without working too hard. I really like the latter, and in the long run this approach has always served me and my customers very well.

These 3 factors are up to the OP of course. All I'm trying to say with my suggestion is that you're not saving anything by hyperconverging, because you ARE making the setup much more complex, with many more moving parts that require configuration, tuning, updates and server resources than you would have by just sticking to the KISS principle.

Dashrender

IPOD = Inverted Pyramid of Doom.
This a term that Scott Allen Miller coined ages ago.

1 - SAN
2 - Switches
2+ - servers

The general belief is that the SAN is 'so good' it will fail less frequently than the other components. Now - in your setup it might be as good as the servers, since it's not one of those manufactured typical SANs, but it's still just a server.

The general idea around these parts you don't go to centralized storage until you have at least 4 hypervisor hosts, otherwise putting storage locally is often much less expensive and less risky.

And of course, we haven't even touched on HA.

Dashrender

@dyasny said in New Infrastructure to Replace Scale Cluster:

These 3 factors are up to the OP of course. All I'm trying to say with my suggestion is that you're not saving anything by hyperconverging, because you ARE making the setup much more complex, with many more moving parts that require configuration, tuning, updates and server resources than you would have by just sticking to the KISS principle.

I definitely like the simpler solution - local storage for local VMs. If you need to move VMs between hosts for maintenance - fine - make sure you have enough resources to do just that, then move them. But moving to shared storage at two compute nodes and basically wasting the compute of the third node doesn't make sense to me, not to mention putting yourself in an IPOD situation - where if you loose the disk, you loose both compute nodes.

dyasny

@Dashrender said in New Infrastructure to Replace Scale Cluster:

IPOD = Inverted Pyramid of Doom.
This a term that Scott Allen Miller coined ages ago.

I am not a part of the Scott Alan Miller quote club, sorry

1 - SAN
2 - Switches
2+ - servers

The general belief is that the SAN is 'so good' it will fail less frequently than the other components. Now - in your setup it might be as good as the servers, since it's not one of those manufactured typical SANs, but it's still just a server.

Well, yes. And we are talking about proper brand name servers. The Dell PE series pretty much in its entirety across all the generations I've worked with (starting with the 4th) have redundant power supplies, redundant and in some models hot-swappable RAM modules, hot swappable drives that can be used in a RAID, so you can tolerate outages. Not everything is replaceable on the fly and not everything is duplicated, but all the components that typically experience outages are. For everything else there is backup, and the aforementioned budget/paranoia/f&f balance to consider and maybe play with.

The general idea around these parts you don't go to centralized storage until you have at least 4 hypervisor hosts, otherwise putting storage locally is often much less expensive and less risky.

How is that calculated exactly?

And of course, we haven't even touched on HA.

oVirt provides HA out of the box, as long as a living host has enough resources available to start the protected VMs.

dyasny

@Dashrender said in New Infrastructure to Replace Scale Cluster:

I definitely like the simpler solution - local storage for local VMs. If you need to move VMs between hosts for maintenance - fine - make sure you have enough resources to do just that, then move them. But moving to shared storage at two compute nodes and basically wasting the compute of the third node doesn't make sense to me, not to mention putting yourself in an IPOD situation - where if you loose the disk, you loose both compute nodes.

it is even easier this way, and you can even move those VMs around, but there will be no HA there. Migrating a local VM with it's storage is basically like live migrating a VM with the RAM size of the disk to be moved. Can take days on a gigabit link, if the disk is large. So again, it's about balance between factors. If HA doesn't matter, use local disks by all means. Just be sure to back up a lot, and you'll even get the benefit of faster disk access for the local VMs. Latency is key to some apps, so it might be a good thing

JaredBusch

@Dashrender said in New Infrastructure to Replace Scale Cluster:

This a term that Scott Allen Miller coined ages ago.

No he didn't. Might be where you firs theard it, but it is not his.

Dashrender

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@Dashrender said in New Infrastructure to Replace Scale Cluster:

This a term that Scott Allen Miller coined ages ago.

No he didn't. Might be where you firs theard it, but it is not his.

Thanks I stand corrected.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

I've never seen a well built SAN go completely down in over 20 years of working with them.

I have. Most SANs fail with reckless abandon. Really good ones are incredibly stable, but everything fails sometimes.

Now, in the storage industry, a "good SAN" would be defined as one that is a part of a cluster. No single box is every that reliable, even the best ones are easily subject to the forklift problem if nothing else.

SANs can approach mainframes in reliability, but to do so is so costly that no one does it. In the real world, that kind of storage carries really high risks and/or cost compared to other options.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

On the other hand, hyperconvergence is a resource drain, with systems like gluster and ceph eating up resources they share with the hypervisor, with neither being aware of each other, and VMs end up murdered by OOM, or just stalled due to CPU overcommitment.

That's misleading. Much like how software RAID uses system resources, which is true. But the cost of creating good external resources is high and the internal needs are low and cheap to increase. Software RAID blows hardware RAID out of the water on performance, even when shared. And you just account for that in your planning.

And there is an assumption that hypervisors and storage are not aware. That can be true, but isn't necessarily. And if you use a SAN, it's guaranteed to be true.

So both of these points are selling points for HC, rather than against it.

dyasny

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

@dyasny said in New Infrastructure to Replace Scale Cluster:

I've never seen a well built SAN go completely down in over 20 years of working with them.

I have. Most SANs fail with reckless abandon. Really good ones are incredibly stable, but everything fails sometimes.

Now, in the storage industry, a "good SAN" would be defined as one that is a part of a cluster. No single box is every that reliable, even the best ones are easily subject to the forklift problem if nothing else.

SANs can approach mainframes in reliability, but to do so is so costly that no one does it. In the real world, that kind of storage carries really high risks and/or cost compared to other options.

let me rephrase myself. I've seen disks, controllers, PSUs, even backplanes and mobos fail in SANs, None of that ever caused an actual outage.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

Gluster and other regular network based storage systems are going to be the bottleneck for the VM performance. So unless you don't care about everything being sluggish, you should think about getting a separate fabric for the storage comms, even if you hyperconverge.

Gluster is not known for high performance. But in the real world, SANs are normally the bottleneck on performance. Storage is always the slow point, and safe storage is even more of one.

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

@scottalanmiller said in New Infrastructure to Replace Scale Cluster:

@dyasny said in New Infrastructure to Replace Scale Cluster:

I've never seen a well built SAN go completely down in over 20 years of working with them.

I have. Most SANs fail with reckless abandon. Really good ones are incredibly stable, but everything fails sometimes.

Now, in the storage industry, a "good SAN" would be defined as one that is a part of a cluster. No single box is every that reliable, even the best ones are easily subject to the forklift problem if nothing else.

SANs can approach mainframes in reliability, but to do so is so costly that no one does it. In the real world, that kind of storage carries really high risks and/or cost compared to other options.

let me rephrase myself. I've seen disks, controllers, PSUs, even backplanes and mobos fail in SANs, None of that ever caused an actual outage.

Right, and I've seen all of those cause outages.

I've seen all of those fail in servers, too. And in some cases, they cause outages and in some they don't. SANs are just servers, but with a special purpose. Same risks that any similar server would have, the tech is all the same.

The problem in the real world is that SANs are so oversold that the work going into making them safe is often skipped because customers aren't as demanding as they are with servers. So the average SAN has a higher price tag for lower reliability than you normally get in the server side of the same market.

dyasny

@scottalanmiller try an ec2 i3.metal

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

The better option, IMO, is to use two hosts as hypervisors, and the third - pack with disks, and use as the storage device (NFS or iSCSI). And also install the engine on it, as a VM or on baremetal - doesn't matter.

Seems a waste. You lose a lot of performance with the networking overhead, you use three hosts for the job of two, and you give up HA. That's a lot of negative. Even if you already own the third host, doing an inverted pyramid of doom is the worst possible use of the existing resources. Better to retire the third host than to make it an anchor that will drown the other two nodes.

scottalanmiller

@Dashrender said in New Infrastructure to Replace Scale Cluster:

@dyasny said in New Infrastructure to Replace Scale Cluster:

@DustinB3403 no, in this particular setup, you have two options. The original one would be to go hyperconverged, installing both the storage and hypervisors services on all 3 hosts, and to also deploy the engine (vsphere equivalent) as a VM in the setup (that's called self hosted engine).

The better option, IMO, is to use two hosts as hypervisors, and the third - pack with disks, and use as the storage device (NFS or iSCSI). And also install the engine on it, as a VM or on baremetal - doesn't matter.

You will have less hypervisors, true, but having a storage service on the hypervisors is a resource drain, so you don't actually lose as much in terms of resources. And you gain a proper storage server, less management headache, and a setup that can scale nicely if you decide to add hypervisors or buy a real SAN. Performance will also be better, and you might even end up with more available disk space, because you will not have to keep 3 replicas of every byte like gluster/ceph require you to do.

Isn't that an IPOD though?

Correct, a standard three node IPOD.

https://mangolassi.it/topic/8743/risk-single-server-versus-the-smallest-inverted-pyramid-design

scottalanmiller

@dyasny said in New Infrastructure to Replace Scale Cluster:

And of course, we haven't even touched on HA.

oVirt provides HA out of the box, as long as a living host has enough resources available to start the protected VMs.

By definition, HA can't be provided "out of the box." HA is something you do, not something you buy. A product may have features to make HA easier, but a product itself can't do HA.

In an IPOD, oVirt would simply automate LA (low availability). HA must be significantly higher than standard availability. The proposed IPOD design results in significantly lower than standard. (Where standard is an enterprise server with local storage and no system of this kind whatsoever.)

scottalanmiller

@JaredBusch said in New Infrastructure to Replace Scale Cluster:

@Dashrender said in New Infrastructure to Replace Scale Cluster:

This a term that Scott Allen Miller coined ages ago.

No he didn't. Might be where you firs theard it, but it is not his.

Actually, I did

May, 2013. It's from a short article originally on SW, but was then codified in this article on the Inverted Pyramid of Doom on SMBITJournal.

I actually did use it first (and second.) It's standard industry terminology now, but before 2013 it was only known as the 3-2-1 Architecture.