Configuration for Open Source Operating systems with the SAM-SD Approach
-
@GotiTServicesInc said:
I know for current leads in our company they're requesting a dual controller box, but being that this box is operating by itself I'm assuming with the additional controller they're buying a bit more redundancy for that.
You don't do dual controllers for protection, not in the real world. RAID controllers do very bad things when you pair them up, that's why enterprise servers don't ship that way, ever. Not even $100K servers are like that. And that's why dual controllers in SMB range SANs are bad, they actually cause outages rather than protecting against it.
One of the most important reads here is: Understanding the Relationship of Reliability and Redundancy
Reliability is the goal, redundancy is a tool. In this case, your redundancy would not support your goal so isn't a viable option.
-
In your OP you said dual SAS controllers, that is how this is handled for high end enterprise servers $50K and up. The high end doesn't use RAID controllers at all, only SAS. SAS controller with software RAID can do the redundancy that can't be done with what is on the market for RAID controllers well.
-
I assumed SAS has built in raid which I realize now was a bad assumtion. dual controllers would just be used to increase the number of drives you could stick in a box then which makes more sense.
-
@GotiTServicesInc said:
so really a HA SAN for an enterprise wouldn't really be more than 2 ish SANs? And more than that you would go for more of a RAIN setup?
Correct, enterprise SANs have traditional always been "scale up" or "vertical scaling" devices. Your size is determined by how big a single SAN can go. HA is provided by a combination of mainframe class design and features and mirroring to a second SAN. That's all we've ever had traditionally.
Moving to scale out systems is very and very niche still. Using scale out with SAN is still relatively rare and problematic. The first vendors are just starting to get their footing on this and there are generally performance issues.
-
@scottalanmiller said:
In your OP you said dual SAS controllers, that is how this is handled for high end enterprise servers $50K and up. The high end doesn't use RAID controllers at all, only SAS. SAS controller with software RAID can do the redundancy that can't be done with what is on the market for RAID controllers well.
So with the OP, would the question be more like, which RAID level should I use over all theses drives in software, vs one controller RAIDed against the other controller?
And would a SAM-SD really look at skipping hardware RAID for software?
-
I did skim that but wasn't sure where the risk vs. reward scale crossed in regards to RAID
-
@GotiTServicesInc said:
I assumed SAS has built in raid which I realize now was a bad assumtion. dual controllers would just be used to increase the number of drives you could stick in a box then which makes more sense.
SAS and SATA are the protocols that storage uses to communicate. RAID controllers talk SAS and/or SATA, but most controllers do this without having hardware RAID.
You don't use multiple controllers for more drives. There isn't any normal server on the market that goes beyond what a single controller can handle. A single good controller will do hundreds of drives.
-
@Dashrender well, if it's not running other apps on the OS why save the CPU with a RAID card?
edit: I just realized you were tl;dr'ing, sorry - this post was for OP
-
@Dashrender said:
@scottalanmiller said:
In your OP you said dual SAS controllers, that is how this is handled for high end enterprise servers $50K and up. The high end doesn't use RAID controllers at all, only SAS. SAS controller with software RAID can do the redundancy that can't be done with what is on the market for RAID controllers well.
So with the OP, would the question be more like, which RAID level should I use over all theses drives in software, vs one controller RAIDed against the other controller?
And would a SAM-SD really look at skipping hardware RAID for software?
That may have been more of what I was looking for but didn't know how to ask the question
-
so why do they make boxes with dual controllers in them then ? I'm referring to your favorite, Nexsan @scottalanmiller
-
@GotiTServicesInc said:
so why do they make boxes with dual controllers in them then ? I'm referring to your favorite, Nexsan @scottalanmiller
This seems like an obvious answer - which I'm going to say is money- because they can sell them. But I'm sure Scott will correct me if I'm wrong on that.
-
So a brief Introduction to Hardware and Software RAID.
Pretty much you use hardware RAID when...
- you run Windows, HyperV or ESXi on the bare metal.
- you want blind drive swaps in a datacenter.
- you lack the operating system and/or software RAID experience or support to use software RAID.
- you just want a lot of convenience.
-
Reasons 1 and 2 (OS/HV choice and blind drive swaps) represent nearly every case as to why hardware RAID is chosen and legitimately make hardware RAID a nearly ubiquitous choice in the SMB.
-
@scottalanmiller said:
So a brief Introduction to Hardware and Software RAID.
Pretty much you use hardware RAID when...
- you run Windows, HyperV or ESXi on the bare metal.
- you want blind drive swaps in a datacenter.
- you lack the operating system and/or software RAID experience or support to use software RAID.
- you just want a lot of convenience.
For the relatively small cost of a RAID controller, there are a lot of reasons to use one.
-
@Dashrender said:
For the relatively small cost of a RAID controller, there are a lot of reasons to use one.
And most people agree and buy them. It's generally a no brainer in the SMB. For me, it's reason 2. I want everyone to have blind swap. Can't overstate how valuable that is in the real world.
-
Are you thinking that you will do software RAID between cluster nodes in this example above? I'm unclear if we are looking at up to three layers of RAID (hardware then software then network?) That's getting extreme.
-
I wasn't sure how far you would really need to take for mission critical data. I was assuming that at some point you only want the data mirrored once. Whether that was mirrored at the drive level or at the system level I wasn't sure. I'm assuming if you have huge storage arrays you'd want to have two SANs with JBOD Arrays, mirrored across each other? or would you want to have a RAID6 setup in each SAN and then each SAN mirrored to each other? this way (like we talked about earlier) you can have a few drives go bad and not have to rebuild the entire array. In my mind the second way (with RAID 6) seems to make the most sense?
-
@GotiTServicesInc said:
I wasn't sure how far you would really need to take for mission critical data. I was assuming that at some point you only want the data mirrored once. Whether that was mirrored at the drive level or at the system level I wasn't sure. I'm assuming if you have huge storage arrays you'd want to have two SANs with JBOD Arrays, mirrored across each other? or would you want to have a RAID6 setup in each SAN and then each SAN mirrored to each other? this way (like we talked about earlier) you can have a few drives go bad and not have to rebuild the entire array. In my mind the second way (with RAID 6) seems to make the most sense?
No large, mission critical storage system is using RAID 0 per node. If you think about drive failure risks and that the time to rebuild a node over the network is long the risk would be insane. Say you have two 100TB SANs. If you lost a single drive on one node, you've lost 100TB of data and all redundancy. Poof, gone, one drive having failed.
Now you failover to your second node. You now have 100TB on a SAN on RAID 0, zero redundancy!! Assuming you could replace the failed SAN in four hours and start a RAID 1 rebuild across the network and assuming that you have 8Gb/s Fibre Channel (which is 1GB/s) that would take 27.7 hours to copy back over assuming zero overhead from the disks, zero overhead from the network protocol, zero delays and absolutely no one accessing the SAN in any way during that 27.7 hour period. Any reads will slow down the network, any writes will slow down the network and require additional data be transferred.
So a theoretical best case scenario restore is 32 hours during which you have 100TB sitting on RAID 0 without anything to protect you from the slightest drive blip. And suddenly you would be vulnerable to UREs too, which would be expected to happen rather often in an array that large.
More realistically you would expect to be restoring for at least tree days if you had dedicated 8Gb/s FC and more like a week and a half if you were only on dedicated GigE iSCSI.
That's a long time to pray a RAID 0 doesn't have an issue.
-
And that rebuild scenario would come up pretty often because drives fail often when you have big numbers of them. This isn't a theoretical edge case, you would expect this to be happening a couple or a few times a year on a really large SAN cluster like this. That's 200TB of drives after redundancy. Even with big 4TB drives, that's 50 of them. They are going to fail from time to time.
-
so for a large setup like that (50 drives), would you want to do a raid 6 per 10 drives and software raid 0 them together to allow for a quicker rebuild time with more drives being able to fail simultaneously?