Large or small Raid 5 with SSD

travisdh1

@Dashrender said in Large or small Raid 5 with SSD:

@scottalanmiller said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

just to clarify, we are talking about two different risks, with two different triggers, correct? The risk of a second disk failure while degraded, which is triggered the moment the first disk dies. The second risk (and less so for SSD) is URE, but my question is does this risk only trigger once you initiate a rebuild? Because it is the rebuild itself that is trying to read the unreadable block during its parity calculation?

The URE risk only triggers once you trigger a rebuild, but the shift risk happens the moment you delay replacing the disk. You can't win through that thought process.

How is the URE not a risk the instant the first drive fails.

It is a risk, but because we're talking SSD specifically, the chance of a URE failure is exponentially smaller than a HDD. The flash will normally fail before a URE. Not impossible, just the chance of it actually happening is much smaller than other failures happening.

Can't a URE happen during normal disk operation? i.e. you're in degraded status - and while reading before starting the rebuild, hit an URE?

Normal operation of the RAID would correct the issue. Degraded status depends on the type of RAID IE: RAID6 degraded mode should function as a RAID5, so a URE doesn't become a problem until the 2nd drive fails.

Again, URE is not an expected failure point for SSD drives. Not that it can't happen, it's just very unlikely.

scottalanmiller

@Dashrender said in Large or small Raid 5 with SSD:

@scottalanmiller said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

just to clarify, we are talking about two different risks, with two different triggers, correct? The risk of a second disk failure while degraded, which is triggered the moment the first disk dies. The second risk (and less so for SSD) is URE, but my question is does this risk only trigger once you initiate a rebuild? Because it is the rebuild itself that is trying to read the unreadable block during its parity calculation?

The URE risk only triggers once you trigger a rebuild, but the shift risk happens the moment you delay replacing the disk. You can't win through that thought process.

How is the URE not a risk the instant the first drive fails. Can't a URE happen during normal disk operation? i.e. you're in degraded status - and while reading before starting the rebuild, hit an URE?

It's not an array risk outside of a rebuild. It's rebuilding that causes the cascade of the URE to make the whole array unreadable.

That a URE can happen is not the fear. That a URE can happen during a rebuild is the fear. Because the array is a "single file" being rewritten during a rebuild and the write operation cannot complete. Leaving everything lost.

scottalanmiller

@travisdh1 said in Large or small Raid 5 with SSD:

Normal operation of the RAID would correct the issue. Degraded status depends on the type of RAID IE: RAID6 degraded mode should function as a RAID5, so a URE doesn't become a problem until the 2nd drive fails.

To be clear, a URE during normal degraded operations does impact one file, but not the array. From the point of view of the array, nothing is wrong. During a rebuild, that same URE takes out the entire array in a parity RAID system. So very different results from the same URE.

1337

@scottalanmiller said in Large or small Raid 5 with SSD:

@travisdh1 said in Large or small Raid 5 with SSD:

Normal operation of the RAID would correct the issue. Degraded status depends on the type of RAID IE: RAID6 degraded mode should function as a RAID5, so a URE doesn't become a problem until the 2nd drive fails.

To be clear, a URE during normal degraded operations does impact one file, but not the array. From the point of view of the array, nothing is wrong. During a rebuild, that same URE takes out the entire array in a parity RAID system. So very different results from the same URE.

Yes, but shouldn't every raid array be setup to do data scrubbing? If you setup mdadm you for sure get data scrubbing by default.

If you scrub the data on a regular basis it's unlikely that an URE should show up just when one SSD has failed. And as in all cases backup is the final solution.

Donahue

@scottalanmiller said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

I know you said earlier that with raid 5, you may as well add that 5th drive to the array and make it a raid 6 as opposed to sitting on the shelf.

Not "might as well", but "had better make sure you do." Difference in risk is astronomic. If you are even thinking hot spare is an option, we've not explain adequately how it works.

I was thinking cold spare, not hot spare. I don't want the array rebuilding automatically before I have time to make a conscience decision to do it. But the different is similar, I still would have a spare and is not helping the array at all just sitting on the shelf.

1337

Also read operations have zero impact on SSDs contrary to spinning harddrives.

Donahue

I know the analogy is not perfect, but in my head I am thinking of the spare disk as a spare tire on a car. having a cold spare on the shelf to me is like having the spare tire mounted to the back or underneath the car, not being actively used to help the car stay on the road. So my instinct is to make sure I've got a spare. In the case of a 4 drive raid 5, that means a 5th disk. But as you say, IF I have that disk anyways, it is better, and as you say, emphatically so, to actually use that disk in the array from the beginning and have a 5 disk raid 6 and no spare. But that leads me back to my original position of not having a spare which my animal brain intuitively thinks of as bad and that I should get a spare. I know that my assumptions and instincts are wrong here, because I do not fully understand the scope of the difference in risks between the 4 drive raid 5 and the 5 drive raid 6. That is why I am asking all these questions, so that I can more fully understand my options and evaluate my choices based on empirical data or good logic, and not on instinct or intuition.

Donahue

I am still thinking of the problem as being one of linear risk and safety, not logarithmic, and that is my fundamental flaw I think.

Dashrender

@scottalanmiller said in Large or small Raid 5 with SSD:

@travisdh1 said in Large or small Raid 5 with SSD:

Normal operation of the RAID would correct the issue. Degraded status depends on the type of RAID IE: RAID6 degraded mode should function as a RAID5, so a URE doesn't become a problem until the 2nd drive fails.

To be clear, a URE during normal degraded operations does impact one file, but not the array. From the point of view of the array, nothing is wrong. During a rebuild, that same URE takes out the entire array in a parity RAID system. So very different results from the same URE.

AWWWW - this is what I was missing. OK a normal read operation will only break one file. Thanks. that explains a lot!

Dashrender

@Donahue said in Large or small Raid 5 with SSD:

I know the analogy is not perfect, but in my head I am thinking of the spare disk as a spare tire on a car. having a cold spare on the shelf to me is like having the spare tire mounted to the back or underneath the car, not being actively used to help the car stay on the road. So my instinct is to make sure I've got a spare. In the case of a 4 drive raid 5, that means a 5th disk. But as you say, IF I have that disk anyways, it is better, and as you say, emphatically so, to actually use that disk in the array from the beginning and have a 5 disk raid 6 and no spare. But that leads me back to my original position of not having a spare which my animal brain intuitively thinks of as bad and that I should get a spare. I know that my assumptions and instincts are wrong here, because I do not fully understand the scope of the difference in risks between the 4 drive raid 5 and the 5 drive raid 6. That is why I am asking all these questions, so that I can more fully understand my options and evaluate my choices based on empirical data or good logic, and not on instinct or intuition.

In the case of the cold spare with RAID 5, if you loose one drive, you're now at risk of a second drive failing, that second drive is doing you zero benefit until the rebuild process is 100% complete - AFTER you start that process.

with RAID 6, you are protected from a second drive failure situation entirely. Now you order a second drive, and assuming no more failures, you stayed as safe as possible during the entire endeaver, BUT, if you loose a second drive during the process, you saved yourself the hassle of restoring because of RAID 6.

This all mostly only matters because you've 'decided' the expense of having the 'spare/extra' drive onsite already was worth it. If you determined that the spare wasn't worth having onsite, then back to RAID 5 you go.

Donahue

@Dashrender said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

I know the analogy is not perfect, but in my head I am thinking of the spare disk as a spare tire on a car. having a cold spare on the shelf to me is like having the spare tire mounted to the back or underneath the car, not being actively used to help the car stay on the road. So my instinct is to make sure I've got a spare. In the case of a 4 drive raid 5, that means a 5th disk. But as you say, IF I have that disk anyways, it is better, and as you say, emphatically so, to actually use that disk in the array from the beginning and have a 5 disk raid 6 and no spare. But that leads me back to my original position of not having a spare which my animal brain intuitively thinks of as bad and that I should get a spare. I know that my assumptions and instincts are wrong here, because I do not fully understand the scope of the difference in risks between the 4 drive raid 5 and the 5 drive raid 6. That is why I am asking all these questions, so that I can more fully understand my options and evaluate my choices based on empirical data or good logic, and not on instinct or intuition.

In the case of the cold spare with RAID 5, if you loose one drive, you're now at risk of a second drive failing, that second drive is doing you zero benefit until the rebuild process is 100% complete - AFTER you start that process.

with RAID 6, you are protected from a second drive failure situation entirely. Now you order a second drive, and assuming no more failures, you stayed as safe as possible during the entire endeaver, BUT, if you loose a second drive during the process, you saved yourself the hassle of restoring because of RAID 6.

This all mostly only matters because you've 'decided' the expense of having the 'spare/extra' drive onsite already was worth it. If you determined that the spare wasn't worth having onsite, then back to RAID 5 you go.

I agree, and that is where I am at now. I need to decide if it is worth having that 5th disk.

scottalanmiller

@Pete-S said in Large or small Raid 5 with SSD:

Yes, but shouldn't every raid array be setup to do data scrubbing? If you setup mdadm you for sure get data scrubbing by default.

If you scrub the data on a regular basis it's unlikely that an URE should show up just when one SSD has failed. And as in all cases backup is the final solution.

Data scrubbing is assumed in the URE stats. Running scrubbing doesn't lower the risk vs. assumed, it lowers it against a bad practice that is ignored. When we state URE risks, that's assuming immediately after a scrub.

scottalanmiller

@Donahue said in Large or small Raid 5 with SSD:

@scottalanmiller said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

I know you said earlier that with raid 5, you may as well add that 5th drive to the array and make it a raid 6 as opposed to sitting on the shelf.

Not "might as well", but "had better make sure you do." Difference in risk is astronomic. If you are even thinking hot spare is an option, we've not explain adequately how it works.

I was thinking cold spare, not hot spare. I don't want the array rebuilding automatically before I have time to make a conscience decision to do it. But the different is similar, I still would have a spare and is not helping the array at all just sitting on the shelf.

This isn't a good idea. You should have an array stable enough that you want it rebuilt. If you have this fear, you need a safer array.

scottalanmiller

@Donahue said in Large or small Raid 5 with SSD:

I am still thinking of the problem as being one of linear risk and safety, not logarithmic, and that is my fundamental flaw I think.

It's neither. It's more complex than that because it is "if this, then this risk" and compounded. It's not a smooth line at all, even a logrithmic one.

scottalanmiller

@Dashrender said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

I know the analogy is not perfect, but in my head I am thinking of the spare disk as a spare tire on a car. having a cold spare on the shelf to me is like having the spare tire mounted to the back or underneath the car, not being actively used to help the car stay on the road. So my instinct is to make sure I've got a spare. In the case of a 4 drive raid 5, that means a 5th disk. But as you say, IF I have that disk anyways, it is better, and as you say, emphatically so, to actually use that disk in the array from the beginning and have a 5 disk raid 6 and no spare. But that leads me back to my original position of not having a spare which my animal brain intuitively thinks of as bad and that I should get a spare. I know that my assumptions and instincts are wrong here, because I do not fully understand the scope of the difference in risks between the 4 drive raid 5 and the 5 drive raid 6. That is why I am asking all these questions, so that I can more fully understand my options and evaluate my choices based on empirical data or good logic, and not on instinct or intuition.

In the case of the cold spare with RAID 5, if you loose one drive, you're now at risk of a second drive failing, that second drive is doing you zero benefit until the rebuild process is 100% complete - AFTER you start that process.

with RAID 6, you are protected from a second drive failure situation entirely. Now you order a second drive, and assuming no more failures, you stayed as safe as possible during the entire endeaver, BUT, if you loose a second drive during the process, you saved yourself the hassle of restoring because of RAID 6.

This all mostly only matters because you've 'decided' the expense of having the 'spare/extra' drive onsite already was worth it. If you determined that the spare wasn't worth having onsite, then back to RAID 5 you go.

Exactly, the difference in protection is unbelievable.

scottalanmiller

@Dashrender said in Large or small Raid 5 with SSD:

@scottalanmiller said in Large or small Raid 5 with SSD:

@travisdh1 said in Large or small Raid 5 with SSD:

Normal operation of the RAID would correct the issue. Degraded status depends on the type of RAID IE: RAID6 degraded mode should function as a RAID5, so a URE doesn't become a problem until the 2nd drive fails.

To be clear, a URE during normal degraded operations does impact one file, but not the array. From the point of view of the array, nothing is wrong. During a rebuild, that same URE takes out the entire array in a parity RAID system. So very different results from the same URE.

AWWWW - this is what I was missing. OK a normal read operation will only break one file. Thanks. that explains a lot!

Correct. And often it's a small file that no one cares about or might even be in "empty space" and truly doesn't matter.

URE to the filesystem is at risk only for the size of the data stored that matters, which is normally tiny compared to the size of the full array.

E.g. an 8TB array might hold 4.5TB of data or which only 2TB is ever needed again. The risk is in a 2TB domain, rather than an 8TB domain. And IF it hits in that space, it is isolated to one file impacted. So the mitigation is extreme.

You hit UREs on your desktop all of the time, and it almost never matters.

Donahue

@scottalanmiller said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

@scottalanmiller said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

I know you said earlier that with raid 5, you may as well add that 5th drive to the array and make it a raid 6 as opposed to sitting on the shelf.

Not "might as well", but "had better make sure you do." Difference in risk is astronomic. If you are even thinking hot spare is an option, we've not explain adequately how it works.

I was thinking cold spare, not hot spare. I don't want the array rebuilding automatically before I have time to make a conscience decision to do it. But the different is similar, I still would have a spare and is not helping the array at all just sitting on the shelf.

This isn't a good idea. You should have an array stable enough that you want it rebuilt. If you have this fear, you need a safer array.

Having never personally used a raid 5, all I have to go on is information that is presented online through mediums like ML. Some, perhaps even most, of the information I find is either out of date or pertains to the use of raid 5 with spinners. I know that in the last 4 years I have had two or three spinners fail in raid 10 arrays, and a few single drives fail in desktops, both spinners and SSD's. So in my mind, a drive failure is a reasonable assumption to occur in the next 5 years. But, we have also never had drives with warranties, so that changes the cost equation too.

I am not sure that my fear is rational, because my understanding of the actual risk is limited.

scottalanmiller

@Donahue said in Large or small Raid 5 with SSD:

@scottalanmiller said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

@scottalanmiller said in Large or small Raid 5 with SSD:

@Donahue said in Large or small Raid 5 with SSD:

I know you said earlier that with raid 5, you may as well add that 5th drive to the array and make it a raid 6 as opposed to sitting on the shelf.

Not "might as well", but "had better make sure you do." Difference in risk is astronomic. If you are even thinking hot spare is an option, we've not explain adequately how it works.

I was thinking cold spare, not hot spare. I don't want the array rebuilding automatically before I have time to make a conscience decision to do it. But the different is similar, I still would have a spare and is not helping the array at all just sitting on the shelf.

This isn't a good idea. You should have an array stable enough that you want it rebuilt. If you have this fear, you need a safer array.

Having never personally used a raid 5, all I have to go on is information that is presented online through mediums like ML. Some, perhaps even most, of the information I find is either out of date or pertains to the use of raid 5 with spinners. I know that in the last 4 years I have had two or three spinners fail in raid 10 arrays, and a few single drives fail in desktops, both spinners and SSD's. So in my mind, a drive failure is a reasonable assumption to occur in the next 5 years. But, we have also never had drives with warranties, so that changes the cost equation too.

I am not sure that my fear is rational, because my understanding of the actual risk is limited.

The MORE you fear a drive failure, the MORE you would fear not rebuilding instantly, automatically. Your fear does not match your response.

scottalanmiller

That a drive might fail is not in question. In five years, there is a good chance of a drive failing.

What you need to do is apply that to your thinking and say "If I fear drives failing, what protects me from that?"

Donahue

am I wrong to think that the probability of two drives failing is much less than the probability of just one drive failing? And while say a 24-48 hour decision window plus rebuild time is a lot more exposure than an instant rebuild time, it is still quite low?