Burned by Eschewing Best Practices
-
@Carnival-Boy said:
I'm not disagreeing, but if the OP says money is no object then you should treat that as fact.
I don't agree. Knowing someone is wrong, confused or doesn't understand something is exactly when they need help most, not the least. Tons and tons of what we do in IT is recognizing when people don't know what they need to know and helping them. In a case like this where we know they have to be wrong and don't understand what they are doing, should we really help them hurt themselves?
I totally get that this goes against my "always give people the benefit of the doubt" theory about never hurt the innocent to protect the guilty, but this is a case where money is never no object, it's simply not true, and it means someone desperately needs help and don't understand that they don't know.
-
@Carnival-Boy said:
Or is forced to spend a certain budget regardless of whether he needs it or not. Who knows, that's not the point.
That would actually make money the ONLY object. Budget would be the whole concern, not just part of it.
-
Well, ok, but the OP isn't' actually on ML so it's a moot point. What I'm really interested in is what the problem is with his solution (ignoring the financial cost) and why it is one of your inverted pyramid thingies. I'm not arguing, I just don't understand and want to learn.
-
An IVPD looks stable and reliable when looking at it from the top, you have a bunch of equipment that supposedly will fail over between the devices.
But what isn't obvious from it, is if you look at it from the side you have individual layers of equipment, which is dependent on everything above or below its self.
So in the most simple example 3-2-1 you have 1 NAS(or SAN) 2 Switches and 3 Servers.
The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device, normally a SAN (but DAS or NAS are valid here as well.) It’s an inverted pyramid because the part that matters, the virtualization hosts, depend completely on the network which, in turn, depends completely on the single SAN or alternative storage device. So everything rests on a single point of failure device and all of the protection and redundancy is built more and more on top of that fragile foundation. Unlike a proper pyramid with a wide, stable base and a point on top, this is built with all of the weakness at the bottom. (Often the ‘unicorn farts’ marketing model of “SANs are magic and can’t fail because of dual controllers” comes out here as people try to explain how this isn’t a single point of failure, but it is a single point of failure in every sense.)
What this means is that there are so many potential points for failure, and that in the most basic approach of the 3-2-1 the "reliability" isn't at all reliable, or is only as reliable as your weakest link, which is often the NAS (or SAN).
Because if any part of that chain breaks the whole system can and likely will come crashing down. Here's a really good explanation from the one and only SAM
-
In addition, and outside of what was brought up in the SW topic, is that there were likely many other Best Practices that were not followed by the OP on SW which lead to him getting burned, regardless of what Hypervisor his employer uses.
And the reason I say this is because the SWOP has stated he was already burned by Citrix Support, which seems very odd, as support hasn't designed the system to fail, but are trying to recover a failed system.
In summation the SWOP has a system that was improperly setup (likely by Eschewing Best Practices) for the benefit of Quick deployment, while not understanding how and why he got burned.
It has nothing to do with Citrix, unless Citrix saw the state of their system and how things were configured and said "Nope Nope Nope, we can't help you as everything you've setup is completely ignoring Best Practice Recommendations in its configuration, it has to be rebuilt." And "We won't support the system in this configuration."
Which is probably how the conversation went.
-
But isn't this 2-2-2 and not 3-2-1? I'm still not getting it.....
"The name refers to three (this is a soft point, it is often two or more) redundant virtualization host servers connected to two (or potentially more) redundant switches connected to a single storage device"
There is no single storage device here. Isn't it a "Tower of Redundancy" rather than a "Pyramid of Doom"? An expensive tower, but a tower. Or maybe a folly.
-
And what's the difference between "Inverted Pyramid of Doom" and the traditional term "Single Point of Failure (SPOF)", as in "a single SAN is a SPOF and therefore a bad solution. You need at least two for redundancy"?
-
This is still a IVPD, because the servers are dependent on the NAS(s), its an improved IPVD (if such a thing could exist) but there are many points that can fail.
Making it an overly complicated solution, and by design reduces the reliability of the system as a whole. Which includes recoverablity, stability and reliability.
-
A Single Point of Failure by its self won't bring the entire organization down.
Only that Point, and what it hosts is unavailable until it's fixed.
-
The best way to think of a SPOF is to take any single server, and unplug it. Without any other backup servers for these functions to migrate to.
That is a SPOF. A system or server, that runs alone, hosting whatever it might be. And when it's down, it and only it are down until the problem is repaired.
-
@DustinB3403 said:
That is a SPOF. A system or server, that runs alone, hosting whatever it might be. And when it's down, it and only it are down until the problem is repaired.
That's not my understanding of SPOF. In the context of the OP, the "system" contains various pieces of hardware (hosts, switches & SANs). If he lacks redundancy in one area of this system (for example, by only having one switch), then that piece of non-redundant hardware is a SPOF. In the pyramid analogy, it is the '1' in 3-2-1 that represents a non-redundant component and the '1' is the SPOF.
-
It still represents the same single point of failure. Any device (including a network switch, NAS, server, or network cable) that doesn't have a redundant "fail-safe" is a SPOF.
-
The trick when building a "system" of anything... is to always be searching for things that have become an SPOF. So let's start with 3 servers and 2 x SANs (Network RAID-1, redundant, automatic failover, etc, etc), and 1 x Switch all in the same building connected to the same power grid and circuits...
The first SPOF is the Network switch. How do we fix it? Add another Network switch (this is assuming that every part of this system is in the same data center / rack).
The next is the fact that they are all on the same circuit. Have the elecrician separate them out.
What happens if the power blips? Need UPSes fo each circuit.
What happens if there's an extended power outage? Need a good generator capable of running for hours or days as neccesary.What about cooling? That goes on its own circuit and hopefully is also connected to the generator...
The list could go on and on forever. The reason so many folks warn about the complexity is that once you've built this giant system... it is extremely complex... and the more reduntant you try to make it, the more complicated (and costly) it gets... The more moving parts you have, the more risk you run of missing something that is obviously another SPOF.
The idea is to find the balance of increased redundancy / automatic failover / reduced down time,cost, and complexity for your organization. It might not get you to the 5 nines. But it might get you say... 3 nines (99.9, right?) of uptime....
This will also involve playing nicely with the bean counters. They will suffer from sticker shock when you show them the price tag for what you want (regardless of if your organization can afford it or not). Work with them and explain how you came up with the system design and how it can save money in the long run. It would also be worth bringing them in at the start to find out exactly what the cost of down time is. So you're not spending half a million dollars to prevent $20 worth of down time.
-
Yes, and it seems to be me that two hosts, two switches and two SANs (2-2-2) offers a decent level of redundancy without over-complicating the system. That's where I'm not getting where the "doom" is coming from.
-
@Carnival-Boy What happens if both SANs go down? Or both switches? Dont' laugh -- I've seen it happen... (and admittedly, I was the cause of it once or twice...)
-
Same as in any redundant system. Two paired disks could fail in a mirrored disk array, two paired power supplies could fail on a host. Redundancy is only about reducing risk, not eliminating it.
And in a lot of cases the risk of dual failure is higher than people would think because the two components aren't completely independent of one another. So failure on one could bring down the other. Human error would be a big cause of this. If you screw up one, there is a good chance you will screw up the other..
-
I think the reason Scott uses the IPOD analogy is to make sure people aren't going into a SAN purchase, for instance, with two eyes blind. Most folks think "Oh, we have two of them, it won't EVER go down"...
-
After rereading the thread... the IPOD / IVPD (inverted pyramid of doom)... Comes from any single thing that can completely bring down the "system".
As you mention the idea is to reduce risk of downtime, and a poorly implemented system does not actually reduce it, but increases it due to the complexity.
Several folks here would recommend for 2 hosts using Replication of VMs from HostA to HostB over using a SAN, because the local disk will almost always be faster than a disk on the network. You take the 10 VMs on HostA and replicate them to HostB, and then take the 10VMs on HostB and replicate them to HostA.
Depending on your hypervisor, you have yourself a nicely recoverable system that is less complex than a 2-2-2 system because you are eliminating the complexities of the SAN. You are also saving a good chunk of money as well. The down side to this sort of replication is that failures can cause lost data between replications.... (IE: VM1 is replicated every 5 minutes, and that VM Host A dies after 4 minutes and 45 secs, you likely will lose the last 5 minutes worth of data). But your only down time will be the amount of time it takes VM1 to reboot on HostB.
-
@dafyre said:
After rereading the thread... the IPOD / IVPD (inverted pyramid of doom)... Comes from any single thing that can completely bring down the "system".
No, that would just be a SPOF. IPOD is specifically a reference to an architecture where the SPOF is the critical "base" of the system, so the most important part to not have be fragile - combined with the top layer(s) being broad and redundant. The design of an IPOD is to be cheap(ish) to build and confusing so that it is easy to sell to management who tend to look from the "top down" and see a big redundant system while not actually providing any safety and, in fact, putting the client at risk. People who confuse redundancy with reliability (which is nearly everyone) are easily duped by this because it comes with the "Is it redundant? Yes" answer that people look for. They forget that redundancy doesn't matter, only reliability does. And the answer to "Is it reliable" is "No, not compared to better, cheaper, easier options."
IPOD is a very specific thing.... redundancy where it doesn't matter to fool the casual observer and cost cutting where people don't look or understand and hope that "magic" will keep them safe.
-
@dafyre said:
I think the reason Scott uses the IPOD analogy is to make sure people aren't going into a SAN purchase, for instance, with two eyes blind. Most folks think "Oh, we have two of them, it won't EVER go down"...
In an IPOD, there isn't two SANs, there are other things that are redundant, but not the storage. Sometimes, to try to fool the people who know enough to point out that there is only one SAN, they will try other tricks like saying the SAN itself is "fully redundant" because it has two controllers in it - something known to be risky and pointless which is why servers don't do it until you are pushing into full active/active.