How Does Local Storage Offer High Availability

scottalanmiller

Just to keep this going, @dafyre please tell us what the old failing system looked like. Was it 10 server each with internal disks? What was failing?

And it doesn't mean that the old system was "bad", it could have just been normal.

Two HP Proliant DL380 servers in a cluster (if the clustering is good) is way more reliable than a single Proliant DL380.

But are two of them as reliable as a single HP Integrity SuperDome? Not likely. Those things never go down. Never. It's unheard of.

Now which is more cost effective? Buying 100 Proliants instead of one SuperDome, of course. Which is more powerful? One SuperDome.

scottalanmiller

@wirestyle22 said:

Even the fact that this is possible is amazing to me

Ever see an HP Integrity withstand an artillery round? There is a video of an HP Integrity doing that (easily ten years old) and another one of an HP 3PAR SAN taking one (more recent, actually the video was made by @HPEStorageGuy who is here in the community.) The HP 3PAR is basically HP's "mini computer" class of storage (same class as the HP Integrity is in servers).

In both cases, they fired an artillery round into the chassis of a running HP system (bolted to a surface of course as the thing would have gone flying) and in both cases the system stayed up and running, didn't lose a ping.

wirestyle22

@scottalanmiller said:

@Dashrender said:

Just to keep this going, @dafyre please tell us what the old failing system looked like. Was it 10 server each with internal disks? What was failing?

And it doesn't mean that the old system was "bad", it could have just been normal.

Two HP Proliant DL380 servers in a cluster (if the clustering is good) is way more reliable than a single Proliant DL380.

But are two of them as reliable as a single HP Integrity SuperDome? Not likely. Those things never go down. Never. It's unheard of.

Now which is more cost effective? Buying 100 Proliants instead of one SuperDome, of course. Which is more powerful? One SuperDome.

Can you clarify as to what you mean? What reason do they attribute to a higher uptime than a ProLiant if they are both configured correctly? Honest question.

wirestyle22

@scottalanmiller said:

@wirestyle22 said:

Even the fact that this is possible is amazing to me

Ever see an HP Integrity withstand an artillery round? There is a video of an HP Integrity doing that (easily ten years old) and another one of an HP 3PAR SAN taking one (more recent, actually the video was made by @HPEStorageGuy who is here in the community.) The HP 3PAR is basically HP's "mini computer" class of storage (same class as the HP Integrity is in servers).

In both cases, they fired an artillery round into the chassis of a running HP system (bolted to a surface of course as the thing would have gone flying) and in both cases the system stayed up and running, didn't lose a ping.

That's wild. HP is doin' it right now.

Dashrender

@scottalanmiller said:

@Dashrender said:

That may be so, but who would care, because in your RAID 0 if you loose any drives, all of your data is gone, so being redundant is pointless in that case - the only think you care about with RAID 0 is the array for performance, not reliability.

You are stuck on the idea that your array always carries stateful data. That's an incorrect assumption. RAID 0 arrays can be perfectly functional when degraded if they are not used for stateful data. So the redundancy remains fully useful.

really? the array will stay active in a degraded state? I had no idea - I figured the RAID controller would basically just kill the array once a drive was lost. yep me and assuming = mistake...

scottalanmiller

@wirestyle22 said:

Can you clarify as to what you mean? What reason do they attribute to a higher uptime than a ProLiant if they are both configured correctly? Honest question.

So the HPE Proliant line is a micro-computer line based on the PC architecture. They are, just for clarify, the industry reference standard for commodity servers (generally considered the best in the business going back to the Compaq Proliant era in the mid-1990s.) They are very good, but they are "commodity". They are basically no different (more or less) than any PC you could build yourself with parts you order online (this is not totally true, there is a tonne of HPE unique engineering, they are tested like crazy, they have custom firmware and boards, they buy parts better than are on the open market, they add some proprietary stuff like the ILO, etc.) but more or less, these are PCs. The DL380 is the best selling server in the world, from any vendor, in any category.

The HPE Integrity line is a mini-computer line. They are not PCs. Most of them (not all) are built on the IA64 EPIC architecture and have RAS [Reliability, availability and serviceability] features that the PC architecture does not support. For example, hot swappable memory and CPUs are standard. Things like redundant controllers are common. The overall build and design is less about cost savings and more about never failing (or being fixed without going down.) It's a truly different class of device. They are also bigger devices, you don't put one in just to run your website. But you can fit more workloads on them, making it make more sense to invest in a single device that almost never fails.

scottalanmiller

@wirestyle22 said:

In both cases, they fired an artillery round into the chassis of a running HP system (bolted to a surface of course as the thing would have gone flying) and in both cases the system stayed up and running, didn't lose a ping.

That's wild. HP is doin' it right now.

HP has been doing this stuff for decades. This isn't new technology. You can get similar from IBM, Oracle and Fujitsu. Dell does not dabble in the mini and mainframe market.

From IBM this would be the i and z series (i is mini and z is main). From Oracle this is the M series. Fujitsu makes the M series for Oracle (they co-design it and Fujitsu makes it) and sells it themselves under their own branding that I don't know as it is not sold in America, you just buy the Oracle branded ones.

wirestyle22

@scottalanmiller said:

@wirestyle22 said:

Can you clarify as to what you mean? What reason do they attribute to a higher uptime than a ProLiant if they are both configured correctly? Honest question.

So the HPE Proliant line is a micro-computer line based on the PC architecture. They are, just for clarify, the industry reference standard for commodity servers (generally considered the best in the business going back to the Compaq Proliant era in the mid-1990s.) They are very good, but they are "commodity". They are basically no different (more or less) than any PC you could build yourself with parts you order online (this is not totally true, there is a tonne of HPE unique engineering, they are tested like crazy, they have custom firmware and boards, they buy parts better than are on the open market, they add some proprietary stuff like the ILO, etc.) but more or less, these are PCs. The DL380 is the best selling server in the world, from any vendor, in any category.

The HPE Integrity line is a mini-computer line. They are not PCs. Most of them (not all) are built on the IA64 EPIC architecture and have RAS [Reliability, availability and serviceability] features that the PC architecture does not support. For example, hot swappable memory and CPUs are standard. Things like redundant controllers are common. The overall build and design is less about cost savings and more about never failing (or being fixed without going down.) It's a truly different class of device. They are also bigger devices, you don't put one in just to run your website. But you can fit more workloads on them, making it make more sense to invest in a single device that almost never fails.

Interesting. Thank you for the information.

scottalanmiller

@Dashrender said:

really? the array will stay active in a degraded state? I had no idea - I figured the RAID controller would basically just kill the array once a drive was lost. yep me and assuming = mistake...

Oh there will be a blip, the array has to re-initialize. It's not a transparent fail like a RAID 1 would be. But it can be automatic and very, very fast. Few people do this, but you can, no problem. You could probably get downtime to a couple of seconds.

Dashrender

@scottalanmiller said:

@Dashrender said:

really? the array will stay active in a degraded state? I had no idea - I figured the RAID controller would basically just kill the array once a drive was lost. yep me and assuming = mistake...

Oh there will be a blip, the array has to re-initialize. It's not a transparent fail like a RAID 1 would be. But it can be automatic and very, very fast. Few people do this, but you can, no problem. You could probably get downtime to a couple of seconds.

Interesting, I had no idea - what would it be good for?

scottalanmiller

@Dashrender said:

Interesting, I had no idea - what would it be good for?

Primarily caching or cache-like databases.

If you were doing a large proxy cache, for example, this would be a great way to handle it. Don't invest too much money where it isn't needed, and in the case of a drive loss, the delay as you flush the cache and reload isn't too tragic.

The need for this is rapidly going away because of SSDs. So I'm not about to run out and do this today, mind you. But five years ago or, far more likely, twenty years ago, this would have been absolutely viable and completely obvious if you had the use case for it. Today, meh, who needs RAID 0 for speed?

Or any kind of read only caching system, even on a database.

Or a short term use database where the data isn't useful after, say, a day. On Wall St. we had a lot of systems that took trillions of transactions per day and then, at the end of the day.... dropped them. With a RAID 0, you might just accept losing a few hours of data, recreate and be underway again because the streaming writes are more important than any potential reliability.

dafyre

In a conversation with a few others, someone brought up the point that all anyone cares about is being able to access the service.

And I agree with this. If the users are able to access the services consistently (reliably, perhaps?) and without data loss, that is the ultimate end goal.

scottalanmiller

@dafyre said:

In a conversation with a few others, someone brought up the point that all anyone cares about is being able to access the service.

And I agree with this. If the users are able to access the services consistently (reliably, perhaps?) and without data loss, that is the ultimate end goal.

That's the cat. More than one way to skin it

Although it should be noted, that the ultimate end goal is doing that cost effectively. In IT, cost is always a factor. So even if we can improve uptime, it has to have an ROI that makes it make sense as well. Otherwise we'd all be running redundant mainframes for everything.

scottalanmiller

@wirestyle22 in a similar vein to the RAS features, some systems like the IBM z series also have a computational verification that is unheard of in lesser platforms. The z series offers the ability to do everything (literally, every clock cycle) twice using two different processors. That way if a CPU fails, memory fails or there is a gamma radiation hit or sun spot or whatever and a bit flips, the system will catch the discrepancy and run the operation again.

scottalanmiller

I forgot about this topic and found it mentioned in a conversation. This thread was a great resource that never got linked anywhere useful. Now to figure out how to make it more referenceable.