Burned by Eschewing Best Practices
-
The best way to think of a SPOF is to take any single server, and unplug it. Without any other backup servers for these functions to migrate to.
That is a SPOF. A system or server, that runs alone, hosting whatever it might be. And when it's down, it and only it are down until the problem is repaired.
-
@DustinB3403 said:
That is a SPOF. A system or server, that runs alone, hosting whatever it might be. And when it's down, it and only it are down until the problem is repaired.
That's not my understanding of SPOF. In the context of the OP, the "system" contains various pieces of hardware (hosts, switches & SANs). If he lacks redundancy in one area of this system (for example, by only having one switch), then that piece of non-redundant hardware is a SPOF. In the pyramid analogy, it is the '1' in 3-2-1 that represents a non-redundant component and the '1' is the SPOF.
-
It still represents the same single point of failure. Any device (including a network switch, NAS, server, or network cable) that doesn't have a redundant "fail-safe" is a SPOF.
-
The trick when building a "system" of anything... is to always be searching for things that have become an SPOF. So let's start with 3 servers and 2 x SANs (Network RAID-1, redundant, automatic failover, etc, etc), and 1 x Switch all in the same building connected to the same power grid and circuits...
The first SPOF is the Network switch. How do we fix it? Add another Network switch (this is assuming that every part of this system is in the same data center / rack).
The next is the fact that they are all on the same circuit. Have the elecrician separate them out.
What happens if the power blips? Need UPSes fo each circuit.
What happens if there's an extended power outage? Need a good generator capable of running for hours or days as neccesary.What about cooling? That goes on its own circuit and hopefully is also connected to the generator...
The list could go on and on forever. The reason so many folks warn about the complexity is that once you've built this giant system... it is extremely complex... and the more reduntant you try to make it, the more complicated (and costly) it gets... The more moving parts you have, the more risk you run of missing something that is obviously another SPOF.
The idea is to find the balance of increased redundancy / automatic failover / reduced down time,cost, and complexity for your organization. It might not get you to the 5 nines. But it might get you say... 3 nines (99.9, right?) of uptime....
This will also involve playing nicely with the bean counters. They will suffer from sticker shock when you show them the price tag for what you want (regardless of if your organization can afford it or not). Work with them and explain how you came up with the system design and how it can save money in the long run. It would also be worth bringing them in at the start to find out exactly what the cost of down time is. So you're not spending half a million dollars to prevent $20 worth of down time.
-
Yes, and it seems to be me that two hosts, two switches and two SANs (2-2-2) offers a decent level of redundancy without over-complicating the system. That's where I'm not getting where the "doom" is coming from.
-
@Carnival-Boy What happens if both SANs go down? Or both switches? Dont' laugh -- I've seen it happen... (and admittedly, I was the cause of it once or twice...)
-
Same as in any redundant system. Two paired disks could fail in a mirrored disk array, two paired power supplies could fail on a host. Redundancy is only about reducing risk, not eliminating it.
And in a lot of cases the risk of dual failure is higher than people would think because the two components aren't completely independent of one another. So failure on one could bring down the other. Human error would be a big cause of this. If you screw up one, there is a good chance you will screw up the other..
-
I think the reason Scott uses the IPOD analogy is to make sure people aren't going into a SAN purchase, for instance, with two eyes blind. Most folks think "Oh, we have two of them, it won't EVER go down"...
-
After rereading the thread... the IPOD / IVPD (inverted pyramid of doom)... Comes from any single thing that can completely bring down the "system".
As you mention the idea is to reduce risk of downtime, and a poorly implemented system does not actually reduce it, but increases it due to the complexity.
Several folks here would recommend for 2 hosts using Replication of VMs from HostA to HostB over using a SAN, because the local disk will almost always be faster than a disk on the network. You take the 10 VMs on HostA and replicate them to HostB, and then take the 10VMs on HostB and replicate them to HostA.
Depending on your hypervisor, you have yourself a nicely recoverable system that is less complex than a 2-2-2 system because you are eliminating the complexities of the SAN. You are also saving a good chunk of money as well. The down side to this sort of replication is that failures can cause lost data between replications.... (IE: VM1 is replicated every 5 minutes, and that VM Host A dies after 4 minutes and 45 secs, you likely will lose the last 5 minutes worth of data). But your only down time will be the amount of time it takes VM1 to reboot on HostB.
-
@dafyre said:
After rereading the thread... the IPOD / IVPD (inverted pyramid of doom)... Comes from any single thing that can completely bring down the "system".
No, that would just be a SPOF. IPOD is specifically a reference to an architecture where the SPOF is the critical "base" of the system, so the most important part to not have be fragile - combined with the top layer(s) being broad and redundant. The design of an IPOD is to be cheap(ish) to build and confusing so that it is easy to sell to management who tend to look from the "top down" and see a big redundant system while not actually providing any safety and, in fact, putting the client at risk. People who confuse redundancy with reliability (which is nearly everyone) are easily duped by this because it comes with the "Is it redundant? Yes" answer that people look for. They forget that redundancy doesn't matter, only reliability does. And the answer to "Is it reliable" is "No, not compared to better, cheaper, easier options."
IPOD is a very specific thing.... redundancy where it doesn't matter to fool the casual observer and cost cutting where people don't look or understand and hope that "magic" will keep them safe.
-
@dafyre said:
I think the reason Scott uses the IPOD analogy is to make sure people aren't going into a SAN purchase, for instance, with two eyes blind. Most folks think "Oh, we have two of them, it won't EVER go down"...
In an IPOD, there isn't two SANs, there are other things that are redundant, but not the storage. Sometimes, to try to fool the people who know enough to point out that there is only one SAN, they will try other tricks like saying the SAN itself is "fully redundant" because it has two controllers in it - something known to be risky and pointless which is why servers don't do it until you are pushing into full active/active.
-
@Carnival-Boy said:
Same as in any redundant system. Two paired disks could fail in a mirrored disk array, two paired power supplies could fail on a host. Redundancy is only about reducing risk, not eliminating it.
Correct, redundancy is just a tool in the hopes of achieving reliability. Redundancy can reduce risk, it can also increase it.
A good example of where redundancy routinely reduces risk a lot is RAID 1 mirroring. It takes "almost certain to have data loss" of a single drive to "almost never have data loss" of a mirrored pair.
A good example of where redundancy itself routinely increases risk is dual SAN controllers in non-active/active arrays (most SANs that SMBs can afford) where each controller can fail and kill the other controller and almost never provide any protection during a real world failure.
-
@Carnival-Boy said:
Two SANs offers a high degree of redundancy. I'm not sure where 3-2-1 fits in with that? He doesn't have a SPOF does he. He has redundant switches, redundant controllers, redundant SANs.
This is not an IPOD (aka 3-2-1.) I believe that his intended architecture is a 3-2-2. This is not a good design, but not nearly as bad as an IPOD.
The issue here is that there is redundancy, yes, but there is redundancy only through adding points of failure that are not necessary. So while there isn't a SPOF, there is unnecessary complexity as well as extra failure domains - three instead of one. So while this design, if implemented well, can be very reliable, it can never be as reliable as not having the storage layer separate nor can it compete on cost. So it isn't risky, it is unnecessarily risky while wasting money, time and effort.
-
When would a 3-2-2 design actually make sense, since I said it wasn't horrible? When it is actually more like a 20-2-2. The point of this kind of design is for when reliability is important but nowhere near the top priority and cost savings at scale matters, which it almost always does in any large company. Once you get to enough physical servers attached to the SAN layer you start to see the ability to lower the cost of storage while making it "reliable enough" to make sense for the business at hand. So typically in an enterprise you might see hundreds or thousands of physical hosts in the "top" layer connected to many switches connected to a pair of big enterprise SANs (EMC VMAX for example.) This is never as reliable as not having the SANs at all, that just can't happen. But what it can be is quite a bit cheaper than not having the SANs and while not the best reliability, it can be pretty reliable to a point where that's not a problem.
The key is that at large scale this design can be cheap. That's why at small scale only local storage makes sense because not only is it the most reliable and the fastest, at small scale it is always the cheapest too.
-
@Carnival-Boy said:
Yeah, but cost isn't an issue as money is no object.
While I don't agree that this is ever true, even if cost is no object, SANs would never make sense since their only value is cost savings at large scale. If cost was never the goal or considered at all but only reliability and speed, that would drive us to bigger, better local storage only.
-
@DustinB3403 said:
What this means is that there are so many potential points for failure, and that in the most basic approach of the 3-2-1 the "reliability" isn't at all reliable, or is only as reliable as your weakest link, which is often the NAS (or SAN).
A better way to word and understand that is that in a dependency chain, which is what the dashes represent, you are always less reliable than your weakest link. It's not just that the SAN represents a weak point in the design, which certainly it does, you also have three failure domains. Two of them are much more reliable than the SAN, but they do present risk on their own and can fail. So your risk is not only the risk of the weakest point failing but of the combined risk of each of the layers.
Think of it think way, you have to roll a die three times (once for each domain.) If you don't get the number that you need, you lose your data. Ready.... go...
On the first roll, the SAN roll, you have to get a 4, 5 or 6. Basically you have a 50% chance of failure.
On the second roll and the third roll, you can get a 2, 3, 4, 5 or 6. You are still rolling and taking risk, but the risk of each roll is much less.
Just because a layer is very, very reliable doesn't mean there isn't risk in it and the risk of the layer is cumulative. So that is why adding layers, even when they are really reliable ones, introduces a negative value in regards to risk and why you only add them when there is a clear reason to do so (cost savings or whatever.)
-
@Carnival-Boy said:
But isn't this 2-2-2 and not 3-2-1? I'm still not getting it.....
I'm playing catch up here so seeing that you nailed this. Yes it is a 2-2-2 or "column" design, far better than a IPOD. But it is not nearly as good as a 2. He has six pieces of gear to fail in three failure domains. What he has is far better (in terms of reliability) than a single server if done well, but not nearly as good as two servers without the external storage. The external storage and the need for external networking in the middle of the storage and servers triples the failure domains without adding anything of value - in fact beyond the risk it only introduces cost, complexity and latency. No upsides, lots of downsides.
-
@Carnival-Boy said:
Yes, and it seems to be me that two hosts, two switches and two SANs (2-2-2) offers a decent level of redundancy without over-complicating the system. That's where I'm not getting where the "doom" is coming from.
Any unnecessary complication is over-complicated. You don't design a system to be "less than ideal" for no reason. The idea that the system isn't "terrible" is correct as long as you don't take into consideration the alternatives. He could have a system that is easier, safer, faster, simpler and cheaper. Why sacrifice all of those things just because something worse in all those ways is still "good enough?" You don't. You go for the clear win. His design, while "good enough" for nearly any scenario, only looks that way if you are dealing with raw numbers rather than the relative ones provided by other approaches.
To think of it another way - if you go to buy a car and you are going to buy a Ford Focus, would you be fine paying $80K for it? If you had no idea what cars cost and never compared other options, sure, a car is a miracle of engineering and over the life of a car you probably get more than $80K of value of it. So you would spend that money. BUT what if you knew that you could buy that car for $20K? Would you still be happy and recommend that someone spend $80K on it when you know that the market value is only $20K and you can go anywhere and buy it for that?
In one case the raw value to you of "a car" might be $80K. But that doesn't mean that you should pay that much when you have the option of getting the car that you need for far less. The $80K would be good enough if better options were not readily available. But given that they are, it's not a good decision.
In the case of IT, imagine that IT is like a car buying consultant. Would you be happy if your car buying consultant sold you an $80K Ford Focus because he knew you could afford it and that it was worth that much to you do have a car? Of course not, you'd say that he wasn't doing his job and looking for the best value. That's what is happening here. The IT guy's job is to not just know how to spend money but how to get good IT value without wasting money. But in this situation, the IT guy is delivering a system worth less but spending tons more on it.
-
Another one for the pile, not nearly as bad as some others, but in this case the Hypervisor infrastructure was setup on a singular Spinning Rust drive, which when it failed killed the entire Hypervisor host.
In this case though, they have a XenPool so recovery should be simple enough, install a new drive and reconfigure the host, lastly rejoin it to the pool.
But had they configured the host from a USB drive, they could simply install the backup USB drive and be up and running in a matter of moment.
-
To boot on the above's system the designer left the battery backup off of their iSCSI storage unit which houses all of their VM's.
"...........ugh..."