Responding to "This BS called URE" from Synology Forums

scottalanmiller

So this guy trying to claim he knows something about math and computers makes a thread on the Synology forums seven years ago and @Dashrender found it today and it seems like it never got any great responses and it is bad to leave this kind of misinformation out there and doing these analyses are always good, so let's break it down (since moving from SW to ML, the amount of "correcting dumbassery and people trying to mislead others has all but disappeared so we don't get to do this much.)

https://community.synology.com/enu/forum/17/post/79816

Here is the OP by "Roadkill401"...

It is really unfortunate when a bunch of tech people that know not much about statistics start to go on about how you are going to loose your SHR raid in rebuild because of URE.

What they are failing to actually get is that the statistic given is not a simple addition together to get 1.

Think of it this way. Everyone has heard of Russian Roulette, a single bullet in a chamber and you spin the cylinder around and it will randomly stop somewhere. Now there is 1 bullet and 5 additional empty chambers. so you have a 1 in 6 chance in pulling the trigger that the gun will go off. Now if you re-spin the cylinder and pull the trigger you chances are still 1 in 6. You could do this 7, 8, 9+ times and never get a bullet, as each and every time you spin, you reset the statistic,

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error. So it should be pretty simple to write a pattern to a disk and then read it back 4 times and you get your URE. You don't even need a raid to be able to do that test.. But funny, that doesn't happen. in day to day operation, you probably read well over 11.3tb of data off your hard dive in a year and we don't seem to have this mass drive failure or bad CRC errors on our files. A single disk has ZERO redundancy and so it can't simply calculate the missing bit of data from a URE.

So what does this URE really mean? if you read a single sector on a disk 10^14 times, statistically you the vast number of disks will start to fail. Now its a statistic, that means you are dealing with a bell curve that will have some disks failing well before that point, and some disks failing well after that point, but the vast majority of disks given a sizeable number of tests will start to fail around that point.

But your disk has billions upon billions of sectors, and each and every one of them has it's own URE. so that is why your disk does not fail and seems to keep on working day after day.

So then you get to your RAID scenario. There you have multiple disks, each with their own URE. And again you can't add a statistic up to give you a smaller number. For instance, if given the hypothetical statistic that you had a 1 in 10 chance of a single hard drive failing, that would not mean that if you put 10 drives into a RAID that one of those drives was defective. So how in any reasonable logic thinking way could Robin Harris get the idea that because a hard drive manufacturer publishes a URE of 10^14 that having 5 drives of 3tb would mean that in a rebuild you will fail? YOU CANNOT ADD STATISTICS TOGETHER!

Stop worrying about someone's misunderstanding of simple mathematics and just start using the devices that you have.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

It is really unfortunate when a bunch of tech people that know not much about statistics start to go on about how you are going to loose your SHR raid in rebuild because of URE.

He starts off saying that people don't know statistics, but he doesn't say whom. The statistics I've seen in most URE discussions are accurate as a trained industrial engineer. Now, maybe he's seen some really bad math that the rest of us haven't seen where people are adding failure rates together (can't do that in stats) but no URE discussion I've been in has done that. So he must be looking in very different places. His lack of references makes this hard, we have to just accept that he is attempting to correct some people who don't know math that we have not encountered.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Think of it this way. Everyone has heard of Russian Roulette, a single bullet in a chamber and you spin the cylinder around and it will randomly stop somewhere. Now there is 1 bullet and 5 additional empty chambers. so you have a 1 in 6 chance in pulling the trigger that the gun will go off. Now if you re-spin the cylinder and pull the trigger you chances are still 1 in 6. You could do this 7, 8, 9+ times and never get a bullet, as each and every time you spin, you reset the statistic,

Right. But the overall risk does not remain one in six. If you play Russian Roulette with the 1 in 6 chamber setup six times you get... 66.5% (go look it up, there is a lot of math involved.) That's a dangerous number, that is "assumed death". Sure, you can survive, but it isn't likely. Someone is likely going to die.

However, in Russian roulette we have multiple players. We generally assume six. In our storage systems we have only one player and that same player puts the same gun to their own head every time. In most cases it is mathematically equivalent to having two or three bullets in the chambers, not just one. Using the RR example is a math trick to make it emotionally feel safer than it really is, and even with RR it should feel insanely dangerous.

Now, all of that still discounts that traditional RR we reset the chamber after each pull so that each player (or trigger pull) always has a 1/6 chance of firing no matter how many times it has happened previously. But that isn't how drive deterioration works.

Drive deterioration is assumed (but only disk engineers really know this, if anyone does and we don't know if they really do) more like atomic deterioration which is pretty predictable. We never know which atom is going to deteriorate, but atomic deterioration is so accurate that we use half life measurements as a way to tell time with extreme accuracy.

Because of this, UREs are far more like the other kind of Russian roulette, the one where we do NOT reset the chamber after each game, but only at the start of the game. In that case, with six players there is a guarantee of failure by the end of the sixth pull, but by the fourth pull, it is really likely to happen. It is only a 1/6 chance that you'd make it to the final pull, but by the sixth one the firing event is guaranteed.

So since the author wants to use RR as his example, the statistics make the disk failure a guarantee, not a risk. It's not quite this bad, obviously, but it is closer to this than it is to anything else. A URE is going to happen, and it is going to happen with a certain pre-determined frequency, but there is some amount of wiggle in the system. But not nearly as much as traditional RR, but more than "single reset" RR.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error.

If using single reset RR, yes. And tests of hard drives bears this out. So the math works in real world testing.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

You don't even need a raid to be able to do that test.. But funny, that doesn't happen.

Actually, you can't test with RAID, because the RAID system protects against UREs essentially completely. RAID only fails from a URE when you have a URE happen on one of the drives in your array at the exact same moment that another URE happens, at the exact same moment, on the exact same bit (parity or mirror) of every drive in the array. On RAID 5 that would mean on two drives, on RAID 6 on three, on RAID 7 on four and on RAID 1 on as many drives as you have in your array (which is often two, but can be any number that you like.) So while a single URE happening is basically a guarantee, and often. Two matching and simultaneous UREs happening even on RAID 5 is so unlikely that it would not be expected to happen in the entire history of humanity. but in theory, it could.

Then he says "funny, that doesn't happen" as if he's never used a computer. People with hard drives without RAID see this constantly! UREs are the most common cause of corrupted files which can do terrible things to your computer or just cause that little spec in an image file that you don't always notice. Audio, video, and other creative professionals are used to looking for these. Office workers are familiar with files corrupting. IT is called in all of the time to repair computers that have had system files get corrupted. Saying "funny, that doesn't happen" is a weird way to phrase "as anyone who uses a computer knows, this happens so often that we all experience it and see it as a normal part of computing."

Dashrender

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error.

If using single reset RR, yes. And tests of hard drives bears this out. So the math works in real world testing.

I want to confirm - you've seen situations where a 4 TB drive has been filled, then read back 4 times and it fails - regularly?

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So what does this URE really mean? if you read a single sector on a disk 10^14 times, statistically you the vast number of disks will start to fail. Now its a statistic, that means you are dealing with a bell curve that will have some disks failing well before that point, and some disks failing well after that point, but the vast majority of disks given a sizeable number of tests will start to fail around that point.

He claims a bell curve. But atomic deterioration does not really have a traditional bell curve. It's far more predictable than that implies. But even if we get a more or less traditional bell, that doesn't change anything statistically. That's all part of the math we are already using.

scottalanmiller

@dashrender said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error.

If using single reset RR, yes. And tests of hard drives bears this out. So the math works in real world testing.

I want to confirm - you've seen situations where a 4 TB drive has been filled, then read back 4 times and it fails - regularly?

Absolutely, everyone has. It's so common even non-IT people are used to it.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

But your disk has billions upon billions of sectors, and each and every one of them has it's own URE. so that is why your disk does not fail and seems to keep on working day after day.

Right, disk failures are by each sector, and that's all. Hence why people just live with them and don't bother protecting against them in most cases. A single sector failure is pretty low risk to a normal computer user. This is all exactly as all URE discussions have said. He's not revealing something new, just pointing out the obvious. The drive doesn't fail, one sector gets a URE out of billions that get read.

The idea that a disk would fail from a URE is a weird injection that he has added here to make people think that the unknown other party is insane. But no URE discussion anywhere assumes that a disk will fail. That's why disk failure and URE are two different failure conditions entirely.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So then you get to your RAID scenario. There you have multiple disks, each with their own URE. And again you can't add a statistic up to give you a smaller number. For instance, if given the hypothetical statistic that you had a 1 in 10 chance of a single hard drive failing, that would not mean that if you put 10 drives into a RAID that one of those drives was defective.

This goes totally off of the rails. No one is discussing hard drives failing or defects. UREs are not defects. Hitting a URE is not a drive failure. The URE rate is the rate at which perfectly good, healthy drives have sector level errors from which recovery is not possible. Hitting a URE is part of drive usage, it is not a defect. It IS an error, but storage error rates are a rate, not a failure of the overall product. It would be like saying your spark plug misfires every one billion cycles, but only once and keeps going. That does not mean your engine has failed or that the spark plug is defective. There is just a 1/x chance that it won't fire that one time. That's just how all things work.

Now if you have a 1/10 chance of a hard drive currently having a URE on it (it doesn't work that way at all, UREs aren't an on-disk artefact, but let's just humor him) and you put ten of those drives into an array, then you definitely expect with very high certainty that there is a URE lurking actively somewhere in that array. Obviously. This isn't hard math at that point. He seems really lost here.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So how in any reasonable logic thinking way could Robin Harris get the idea that because a hard drive manufacturer publishes a URE of 10^14 that having 5 drives of 3tb would mean that in a rebuild you will fail? YOU CANNOT ADD STATISTICS TOGETHER!

He repeats "you cannot add statistics together" in the hopes that in doing so, we will think that someone did. But no one did, he's playing to the reader's emotions.

Robin Harris has covered this extensively and all of Robin's numbers give a chance of failure. Exactly the thing that the OP here is alluding to wanting. He just pretends that he didn't get it. Either he's just trying to be a jerk and trick people, or he doesn't see statistical failure as a chance but as a sure thing. Robin always presents the math as "and this is the chance that you will hit the error", always. All of Robin's papers with this math are linked in the MangoLassi RAID link list for reference.

So the real question is, how can you with any reasonable logic, not see how likely the URE is to be hit with the incredibly large drive sizes that we have today?

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Stop worrying about someone's misunderstanding of simple mathematics and just start using the devices that you have.

He ends with "don't worry about being IT professionals, don't bother to protect your data, just trust your vendors to magically provide protection that you didn't ask for, pay for, nor did they claim to give."

It's super weird to say to trust the devices here, when the device makers are the ones warning us of the risks!

scottalanmiller

In the second response, Charles Hooper references my paper on the same:

Roadkill401,
Is this the issue that you are describing?
http://www.smbitjournal.com/2012/05/when-no-redundancy-is-more-reliable/

"What happens that scares us during a RAID 5 resilver operation is that an unrecoverable read error (URE) can occur. When it does the resilver operation halts and the array is left in a useless state – all data on the array is lost. On common SATA drives the rate of URE is 10^14, or once every twelve terabytes of read operations. That means that a six terabyte array being resilvered has a roughly fifty percent chance of hitting a URE and failing."

I have a degree in mathematics - but I have been focused on computer technology for roughly the last 20 years (so my mathematics skills are a bit rusty).

I believe that you are correct to a degree. In the above quoted example, there is not a roughly 50 percent chance of hitting a URE and having the array fail during the rebuild (resilver). Just as it is possible to roll a six sided die 10 times and never have the number six come up on top - it is a problem of probability, not straight addition and division. Also keep in mind that a drive's actual URE statistic does not remain constant through the life of the drive - the actual URE statistic decays as the drive ages.

Let's use a simple example that I have posted on the Synology forums before. Consider a four drive RAID 5 (SHR) array composed of 2TB drives. When one drive fails, that RAID 5 array has roughly 48,000,000,000,000 data bits that must be read successfully without a URE for the array to rebuild successfully when the failed drive is replaced. Using just the URE statistic provided by drive manufacturers, drives in this RAID 5 array with a one URE in 10^14 rating have a roughly 38.1% chance of failing to successfully rebuild when the failed drive is replaced. Here is the equation:
(1 - (99,999,999,999,999 / 100,000,000,000,000) ^ 48,000,000,000,000) = 0.380979164

As you stated, drives are read a sector at a time, not a bit at a time. Most drives are now offered with 4KB sector sizes, rather than the older 512 byte sector size, so the drives with a one URE in 10^14 bit rating actually have a one URE in 3,051,757,813 4KB sector rating. In the same four drive RAID 5 array, there are roughly 1,464,843,750 4KB sectors in the non-failed drives. Again there is a roughly 38.1% chance of failing to successfully rebuild when the failed drive is replaced. Here is the equation:
(1 - (3,051,757,812 / 3,051,757,813) ^ 1,464,843,750 = 0.381216604

For comparison, a four drive RAID 10 array composed of 2TB drives has a roughly 14.8% chance of failure during the rebuild:
(1 - (99,999,999,999,999 / 100,000,000,000,000) ^ 16,000,000,000,000) = 14.77%

For arrays with a larger number of drives the difference between RAID 10 and RAID 5 (SHR) becomes even more significant because only a single other drive in a RAID 10 array must be fully read error free, while in a RAID 5 array all other drives in the array must be fully read error free.

IRJ

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So this guy trying to claim he knows something about math and computers makes a thread on the Synology forums seven years ago and @Dashrender found it today and it seems like it never got any great responses and it is bad to leave this kind of misinformation out there and doing these analyses are always good, so let's break it down (since moving from SW to ML, the amount of "correcting dumbassery and people trying to mislead others has all but disappeared so we don't get to do this much.)

https://community.synology.com/enu/forum/17/post/79816

Why not reply to original post on synology. It doesn't make sense to address it here

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

I believe that you are correct to a degree. In the above quoted example, there is not a roughly 50 percent chance of hitting a URE and having the array fail during the rebuild (resilver). Just as it is possible to roll a six sided die 10 times and never have the number six come up on top - it is a problem of probability, not straight addition and division.

So yes and no. Let's start with the die example. There is the statistics, and there is the "chance". There is a "chance" that you will roll a single die a billion times and never get a six. Yes. Obviously. We all know that. But statistically, you will get it pretty quickly.

Quick stats math...

(5/6)(5/6)(5/6)(5/6)(5/6)*(5/6) = 15625/ 46656 = .33 chance of NOT having it happen. so

1 - .33 = .66 or 66% chance of hitting the "die URE" of a six.

That's right, it's not lower than 50% in the dice example, it's higher, a lot higher. Yes, there is still a chance, a decent one, that you won't hit it. But the chances are if you roll a die ten times that you will get a six. Very good chances.

The "roughly 50%" number was based on statistics math. It just happens to be that at around the 50% chance mark additive numbers and statistical numbers are pretty close. They diverge as you leave the top of the bell in either direction, but they are pretty much on top of each other right in the middle. So because the writers here likely don't know statistical math, all they can do is see that additive math would have gotten us into the same ballpark.

scottalanmiller

@irj said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So this guy trying to claim he knows something about math and computers makes a thread on the Synology forums seven years ago and @Dashrender found it today and it seems like it never got any great responses and it is bad to leave this kind of misinformation out there and doing these analyses are always good, so let's break it down (since moving from SW to ML, the amount of "correcting dumbassery and people trying to mislead others has all but disappeared so we don't get to do this much.)

https://community.synology.com/enu/forum/17/post/79816

Why not reply to original post on synology. It doesn't make sense to address it here

I tried, they don't let you. Since they codified it in their archives, I wanted to make sure it was addressed somewhere, at least.

scottalanmiller

@irj said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

So this guy trying to claim he knows something about math and computers makes a thread on the Synology forums seven years ago and @Dashrender found it today and it seems like it never got any great responses and it is bad to leave this kind of misinformation out there and doing these analyses are always good, so let's break it down (since moving from SW to ML, the amount of "correcting dumbassery and people trying to mislead others has all but disappeared so we don't get to do this much.)

https://community.synology.com/enu/forum/17/post/79816

Why not reply to original post on synology. It doesn't make sense to address it here

Trying again. I tried to put all this there but the "comment" buttons didn't do anything. Maybe they were having an issue. Let's see...

scottalanmiller

No luck, even when signed in the "comment" and "reply" fields appear to be disabled. Which makes sense, this is their legacy forum.

scottalanmiller

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

As you stated, drives are read a sector at a time, not a bit at a time. Most drives are now offered with 4KB sector sizes, rather than the older 512 byte sector size, so the drives with a one URE in 10^14 bit rating actually have a one URE in 3,051,757,813 4KB sector rating. In the same four drive RAID 5 array, there are roughly 1,464,843,750 4KB sectors in the non-failed drives. Again there is a roughly 38.1% chance of failing to successfully rebuild when the failed drive is replaced. Here is the equation:
(1 - (3,051,757,812 / 3,051,757,813) ^ 1,464,843,750 = 0.381216604

This is a little confusing. While bigger sectors does mean bigger potential failures, it does not change the failure rate overall because URE is measured in bit reads, not sector reads.

Dashrender

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@dashrender said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

@scottalanmiller said in Responding to "This BS called URE" from Synology Forums:

Now apply this URE to your hard disk. They say that this magical 10^14 works out to 11.3 TB of information. So on a single 4tb hard drive, you would simply need to fill that drive and read the info back off it 4 times and it will give you an error.

If using single reset RR, yes. And tests of hard drives bears this out. So the math works in real world testing.

I want to confirm - you've seen situations where a 4 TB drive has been filled, then read back 4 times and it fails - regularly?

Absolutely, everyone has. It's so common even non-IT people are used to it.

I wasn't thinking.. of course, I probably have run into this on a single drive and didn't realize what the issue was - a single file failed.. not generally a huge deal.