Application clustering VS RAID with modern SSD
-
I have POF setup of an enterprise server (IBM x3550m4) running with just an Intel PCIe-card-form-factor p3600 1.6Tb, hosting an ERP with MS SQL DB and a fileserver with ~400Gb of regular office files. Have done several benchmarks in the last months and everything is running just fine.
I will soon put this stuff in production, and I have a spare (new, with 1.2 Tb of 10k spindles) x3550m4... within my budget (~1500 euro, less is better), I’m thinking about two architectural pattern to get better reliability than the POF single node:
- Buy another enterprise SSD (maybe the samsung 1725a, 1.6 Tb too) for the other node, setup a SQL replication for the DB and fileserver sync daemon (DRFS, Synchthing, ecc), and failover the DNS in case of the first node fail (in any way). We can tolerate some hours of downtime;
- Buy a couple of smaller SAS SSD and RAID 1 them, use those as the primary storage, use the p3600 or the spindles for the replicas.
A couple of consideration about that:
- I think that today’s enterprise-class PCIe SSD in the first 5 years from deployment and with the right overprovisioning (like the ones that I mentioned) almost as reliable as a RAID controller, because they have a full solid-state storage controller, no moving parts and a declared MTBF that is very reassuring;
- Those services can tolerate even a day of downtime every 3-5 years without major impact on revenues;
- I don’t see the point of RAID and in general of node-level reliability if I can rely on better application clustering, this idea was inspired by hyperscalers/opencompute machines, that works on single PSU etc. because their reliability is achieved at an higher level.
What do you think about it? Any hints are welcome!
-
@scottalanmiller I’m sure you’ve written something about application clustering somewhere.
-
@francesco-provino said in Application clustering VS RAID with modern SSD:
@scottalanmiller I’m sure you’ve written something about application clustering somewhere.
Something, but not anything covering this specifically.
-
There are multiple layers that can address needs and remove a need for RAID. Essentially, all of them are things that are "more than" RAID.
-
In many cases, such as a two node application cluster, you can certain do without RAID but typically do so by effectively replicating RAID in a different way.
So let's take a MySQL two node cluster as an example. Doing MySQL clustering is fine but for all intents and purposes requires a combination of manually mirroring and the application clustering acting like Network RAID 1. It's not actually RAID, but is acting like it.
The reason you use RAID normally, local RAID that is, is to avoid node rebuilds. Whether it's Network RAID, or application mirroring or whatever, there is an impact for a node rebuild. If you have no local RAID, those rebuilds become incredibly frequent rather than rare to non-existant.
-
Some things like stateless app servers, rebuilds might not affect the app at all. So skipping RAID might be totally feasible with little to no impact. But something like a database where other nodes have to recreate data over the network, it might be pretty negative.
-
Thanks @scottalanmiller . I thin I’ll start with a DRBD over two single PCIe NVMe cards (one in each nodes) synced through an infiniband link (infiniband is cheap today!), and I will slowly move every capable workloads to application clustering.
-
@francesco-provino said in Application clustering VS RAID with modern SSD:
Thanks @scottalanmiller . I thin I’ll start with a DRBD over two single PCIe NVMe cards (one in each nodes) synced through an infiniband link (infiniband is cheap today!), and I will slowly move every capable workloads to application clustering.
Just remember, application clustering at that level is normally a large cost, whereas RAID is a low cost. And application clustering is slow, while RAID is fast.
But DRBD cannot be part of application clustering. DRBD is a RAID system. So you can, of course, use DRBD and application clustering, but they are two very different things.
Local RAID is about speed and low cost. DRBD is Network RAID, so high cost and introduces latency. Application clustering doesn't need RAID of any sort, as it is already clustered. You would use totally independent storage for application clustering.
-
But consider cost and risk of traditional RAID 1 vs. mirrored application clustering for a workload like MariaDB (just as a sample.)
Base Server: $10K
Application Clustering: You need two servers, so your cost is $20,000. And that's assuming application clustering is available for the workload, and free. It is free with MariaDB, so this is a good use case.
Traditional RAID: You need an extra SSD for your one server, so say add $500 onto your base cost for a total of $10,500. That's a fraction of the cost of the application clustering.
-
Performance:
Application Clustering: Because data has to be synced over the network, there is a performance hit from application clustering. For enough money, you can minimize this greatly, but it just costs more and more to do so.
Traditional RAID: RAID 1 is faster than no-RAID. And moving to things like RAID 10 can speed you up even more. So rather than taking a performance hit, RAID for protection of this nature will speed you up.
-
Reliability:
Application Clustering requires everything be duplicated, even CPUs and RAM, so there are some benefits to reliability improvements from the high cost of redudancy. But typically these are minor, as the extra redundancy is typically around pieces that rarely fail. It's a brute force redundancy, rather than a finesse redundancy.
RAID targets the pieces of the system that are most fragile and critical - the storage. It is the drives failing alone that causes full data loss, and drives represent the majority of hardware failures. So you get 99% of the protection, at a fraction of the price.
Because RAID is so mature and reliable, there is an argument that that combined with its insane speed, cache protection options and such will actually be safer than application layer protections that are comparable.
-
Effort:
Application Clustering requires a lot of expertise, and unique expertise to each and every workload, which must then be monitored, maintained, and updated to keep working. This often triples or quadruples the effort to build and maintain a workload and in extreme cases can be far worse. This is an ongoing effort requiring expertise around maintaining clustering and dealing with edge situations, software changes, and so forth. This is generally outside of the skill set of many IT shops, depending on the workloads. Some clustering, like Windows AD is really simple, some like many databases, is very hard.
RAID zero effort. Tell it to turn on, ignore it. There is nothing to know or do and the system can be safely turned over even to non-technical staff to maintain.
-
Cost of Licensing:
Application this is often a costly add on to many software products (and is not always available), and often requires extra software purchases. For example, with MS SQL Server it generally requires more Windows Server and SQL Server licenses, plus additional cost for the application clustering layer. So for many workloads, and any on Windows, the licensing cost soars rapidly.
RAID no known products have any licensing costs tied to block storage redundancy. So there is no cost of this in the real world.
-
@scottalanmiller I see your points, but let me just add some additional information about the current configuration:
- we already have three identical server and one NVMe PCIe card;
- I want to use DRBD replication only for stuff that cannot be made high-available without upgrades like SAL server standard etc. Thinking of use syncthing for file replication.
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
But consider cost and risk of traditional RAID 1 vs. mirrored application clustering for a workload like MariaDB (just as a sample.)
Base Server: $10K
Application Clustering: You need two servers, so your cost is $20,000. And that's assuming application clustering is available for the workload, and free. It is free with MariaDB, so this is a good use case.
Traditional RAID: You need an extra SSD for your one server, so say add $500 onto your base cost for a total of $10,500. That's a fraction of the cost of the application clustering.
I already have servers, they are the same spec and out of vendor support.
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Performance:
Application Clustering: Because data has to be synced over the network, there is a performance hit from application clustering. For enough money, you can minimize this greatly, but it just costs more and more to do so.
Traditional RAID: RAID 1 is faster than no-RAID. And moving to things like RAID 10 can speed you up even more. So rather than taking a performance hit, RAID for protection of this nature will speed you up.
Async replication has almost NO performance hit on the master.
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Reliability:
Application Clustering requires everything be duplicated, even CPUs and RAM, so there are some benefits to reliability improvements from the high cost of redudancy. But typically these are minor, as the extra redundancy is typically around pieces that rarely fail. It's a brute force redundancy, rather than a finesse redundancy.
RAID targets the pieces of the system that are most fragile and critical - the storage. It is the drives failing alone that causes full data loss, and drives represent the majority of hardware failures. So you get 99% of the protection, at a fraction of the price.
Because RAID is so mature and reliable, there is an argument that that combined with its insane speed, cache protection options and such will actually be safer than application layer protections that are comparable.
This is unfair, you are really comparing apple to oranges: in one case you have a completely shared-nothing cluster, in the other you are just protected from storage disk failure. What if the cpu/mobo/controller/psu/etc fail?
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Effort:
Application Clustering requires a lot of expertise, and unique expertise to each and every workload, which must then be monitored, maintained, and updated to keep working. This often triples or quadruples the effort to build and maintain a workload and in extreme cases can be far worse. This is an ongoing effort requiring expertise around maintaining clustering and dealing with edge situations, software changes, and so forth. This is generally outside of the skill set of many IT shops, depending on the workloads. Some clustering, like Windows AD is really simple, some like many databases, is very hard.
RAID zero effort. Tell it to turn on, ignore it. There is nothing to know or do and the system can be safely turned over even to non-technical staff to maintain.
Mostly true, but very different stuff.
-
@scottalanmiller said in Application clustering VS RAID with modern SSD:
Cost of Licensing:
Application this is often a costly add on to many software products (and is not always available), and often requires extra software purchases. For example, with MS SQL Server it generally requires more Windows Server and SQL Server licenses, plus additional cost for the application clustering layer. So for many workloads, and any on Windows, the licensing cost soars rapidly.
RAID no known products have any licensing costs tied to block storage redundancy. So there is no cost of this in the real world.
This is true, and I’m trying ti avoid that cost via drbd replication.
-
@francesco-provino said in Application clustering VS RAID with modern SSD:
- I want to use DRBD replication only for stuff that cannot be made high-available without upgrades like SAL server standard etc. Thinking of use syncthing for file replication.
That can work, but things like RSYNC are often better for that. Less latency.