Application clustering VS RAID with modern SSD

scottalanmiller

@francesco-provino said in Application clustering VS RAID with modern SSD:

@scottalanmiller said in Application clustering VS RAID with modern SSD:

But consider cost and risk of traditional RAID 1 vs. mirrored application clustering for a workload like MariaDB (just as a sample.)

Base Server: $10K

Application Clustering: You need two servers, so your cost is $20,000. And that's assuming application clustering is available for the workload, and free. It is free with MariaDB, so this is a good use case.

Traditional RAID: You need an extra SSD for your one server, so say add $500 onto your base cost for a total of $10,500. That's a fraction of the cost of the application clustering.

I already have servers, they are the same spec and out of vendor support.

That's very different. If the goal is to use "whatever hardware is lying around" rather than designing for a specific use case, then anything that fits the needs of the hardware might make sense.

scottalanmiller

@francesco-provino said in Application clustering VS RAID with modern SSD:

@scottalanmiller said in Application clustering VS RAID with modern SSD:

Performance:

Application Clustering: Because data has to be synced over the network, there is a performance hit from application clustering. For enough money, you can minimize this greatly, but it just costs more and more to do so.

Traditional RAID: RAID 1 is faster than no-RAID. And moving to things like RAID 10 can speed you up even more. So rather than taking a performance hit, RAID for protection of this nature will speed you up.

Async replication has almost NO performance hit on the master.

HA Application Clustering is always sync, though, not async. Application needs to wait for confirmation for its peers before unlocking, or else it is not competing with RAID for data protection.

scottalanmiller

@francesco-provino said in Application clustering VS RAID with modern SSD:

@scottalanmiller said in Application clustering VS RAID with modern SSD:

Reliability:

Application Clustering requires everything be duplicated, even CPUs and RAM, so there are some benefits to reliability improvements from the high cost of redudancy. But typically these are minor, as the extra redundancy is typically around pieces that rarely fail. It's a brute force redundancy, rather than a finesse redundancy.

RAID targets the pieces of the system that are most fragile and critical - the storage. It is the drives failing alone that causes full data loss, and drives represent the majority of hardware failures. So you get 99% of the protection, at a fraction of the price.

Because RAID is so mature and reliable, there is an argument that that combined with its insane speed, cache protection options and such will actually be safer than application layer protections that are comparable.

This is unfair, you are really comparing apple to oranges: in one case you have a completely shared-nothing cluster, in the other you are just protected from storage disk failure. What if the cpu/mobo/controller/psu/etc fail?

It may be apples and oranges, but that's where it starts. It's two very different things and under normal circumstances you never consider application replication unless you have RAID. RAID is cheap and really effective.

Although it might seem like apples and oranges, it's like turbo charging or getting a bigger engine - very different techniques, same goal. Here it is two reliability techniques, one goal. The point is, of the two, 90% of the time RAID is actually more effective. It might SEEM like having all those other parts with extra redundancy would do a lot for you, but in the real world it doesn't do all that much. And the risks you take on by avoiding the RAID will rarely be offset by all that extra redundancy.

Think of it like an airplane... the thing you want redundant is the engine. Sure extra seats, steering wheels, wings, wheels, etc. all sound great, and if they are free then sure, but 95% of the time it is the engine that fails, not any of those things. So you can't get distracted by the "what if X happens", you have to remain focused on the resultant reliability and I think that you will find that RAID is either around a break even or even safer than a non-RAID general redundancy approach at the same redundancy level (single mirror, double mirror, etc.)

scottalanmiller

@francesco-provino said in Application clustering VS RAID with modern SSD:

@scottalanmiller said in Application clustering VS RAID with modern SSD:

Effort:

Application Clustering requires a lot of expertise, and unique expertise to each and every workload, which must then be monitored, maintained, and updated to keep working. This often triples or quadruples the effort to build and maintain a workload and in extreme cases can be far worse. This is an ongoing effort requiring expertise around maintaining clustering and dealing with edge situations, software changes, and so forth. This is generally outside of the skill set of many IT shops, depending on the workloads. Some clustering, like Windows AD is really simple, some like many databases, is very hard.

RAID zero effort. Tell it to turn on, ignore it. There is nothing to know or do and the system can be safely turned over even to non-technical staff to maintain.

Mostly true, but very different stuff.

That it is different really doesn't matter. Remember, the goal is resulting reliability. So while it is different, we don't care, we care that the better reliability or equal reliability is cheaper, easier, and faster.

scottalanmiller

@francesco-provino said in Application clustering VS RAID with modern SSD:

@scottalanmiller said in Application clustering VS RAID with modern SSD:

Cost of Licensing:

Application this is often a costly add on to many software products (and is not always available), and often requires extra software purchases. For example, with MS SQL Server it generally requires more Windows Server and SQL Server licenses, plus additional cost for the application clustering layer. So for many workloads, and any on Windows, the licensing cost soars rapidly.

RAID no known products have any licensing costs tied to block storage redundancy. So there is no cost of this in the real world.

This is true, and I’m trying ti avoid that cost via drbd replication.

DRBD replication can only avoid licensing costs (in most cases) if you don't have the workloads running in the second location. Which, of course, you can do. But that's really not how it is designed to be used. But it's certainly valid. But this results in something drastically different as you pointed out about the app vs RAID earlier. In your DRBD non-licensed model, if one drive fails, your application fails. Giving you dramatically less protection than regular RAID.

So again, if this is about reusing what you have, then cost savings might trump everything else. If it is about planning for alternative high availability approaches, I think the "don't consider anything until you have RAID locally" manta remains valid for all but the rarest cases.

Francesco Provino

Again, valid point. The alternative is to put another pcie ssd in the first node and raid it (mdadm). And, of course, buy another TWO of them and put it in the other node, in case of the first one failed. This is gonna be much higher in costs…

scottalanmiller

@francesco-provino said in Application clustering VS RAID with modern SSD:

Again, valid point. The alternative is to put another pcie ssd in the first node and raid it (mdadm). And, of course, buy another TWO of them and put it in the other node, in case of the first one failed. This is gonna be much higher in costs…

That's misleading because it's not the real alternative. If you were okay with no RAID, but two nodes, you are okay with one node and RAID. Your leap to needing a second node doesn't make sense, it's a level of reliability you don't require. So that's an apple to the orange.

It's a second SSD in the single node and NO second node that is comparable, and easily safer, than two nodes without RAID. Even if it isn't safer, it's REALLY close.

So you can't use the "need a second node with RAID" scenario as a comparison for anything, it's outside of the scope and not roughly comparable. So ignore it, it's not relevant.

Your options are... one node with RAID, or two nodes with network RAID. Single node with regular RAID is faster, simpler, cheaper, and easily comparable if not better for reliability. A second node with no RAID is just vastly impracticable unless it is somehow free while having RAID is not.

Francesco Provino

@scottalanmiller said in Application clustering VS RAID with modern SSD:

@francesco-provino said in Application clustering VS RAID with modern SSD:

Again, valid point. The alternative is to put another pcie ssd in the first node and raid it (mdadm). And, of course, buy another TWO of them and put it in the other node, in case of the first one failed. This is gonna be much higher in costs…

That's misleading because it's not the real alternative. If you were okay with no RAID, but two nodes, you are okay with one node and RAID. Your leap to needing a second node doesn't make sense, it's a level of reliability you don't require. So that's an apple to the orange.

It's a second SSD in the single node and NO second node that is comparable, and easily safer, than two nodes without RAID. Even if it isn't safer, it's REALLY close.

So you can't use the "need a second node with RAID" scenario as a comparison for anything, it's outside of the scope and not roughly comparable. So ignore it, it's not relevant.

Your options are... one node with RAID, or two nodes with network RAID. Single node with regular RAID is faster, simpler, cheaper, and easily comparable if not better for reliability. A second node with no RAID is just vastly impracticable unless it is somehow free while having RAID is not.

Uhm, I’m sorry but I don’t agree with you.
The network RAID will have the same cost (always one other SSD) and BETTER reliability, because even if the mainboard/cpu/ecc fail in the first node, I will have another ready-to-go host with all I need to start my environment.

There is also the possibility of create TWO drbd replica set, one active on the first node and the other active on the second; that way, I can easily double the total cpu count and ram available for the VMs… sort of hyperconvergency on the cheap!

scottalanmiller

@francesco-provino said in Application clustering VS RAID with modern SSD:

The network RAID will have the same cost (always one other SSD) and BETTER reliability, because even if the mainboard/cpu/ecc fail in the first node, I will have another ready-to-go host with all I need to start my environment.

I explained why that isn't how reliability was measured earlier, though. Stating a "what if" doesn't make it more reliable. Motherboard and CPU rarely fail, so just because you protect against that, at the cost of making the storage far less reliable, doesn't make the system more reliable. It makes failover and recovery of the majority of failures far worse and more risky.

This approach is protecting against the unlikely because it "sounds bad" rather than protecting better against the most likely because it's boring.

scottalanmiller

@francesco-provino said in Application clustering VS RAID with modern SSD:

There is also the possibility of create TWO drbd replica set, one active on the first node and the other active on the second; that way, I can easily double the total cpu count and ram available for the VMs… sort of hyperconvergency on the cheap!

It's hyperconverged whether you do that or not. HC is free, even with far more robust systems like Starwind. HC doesn't imply that you have HA or can move workloads around. Most people do that, but it's HC from the moment you go with the design here. But DRBD isn't saving you anything over normal baseline. So while this is cheap, it's not special or cheaper, and it's a well known model that under normal circumstances you would never do without local RAID because it's been analyzed heavily for decades and it just doesn't provide a logical protection versus simpler, cheaper approaches.