Recovery Time Objectives - How can I come up with a real world number...
-
I shouldn't ask they should have to review the document and perform the calculations and come back with an answer. I'm just attempting to put something on paper.
Which maybe I shouldn't?
-
@scottalanmiller said in Recovery Time Objectives - How can I come up with a real world number...:
@coliver said in Recovery Time Objectives - How can I come up with a real world number...:
Isn't this a business thing? The business should mandate an RTO and IT should build a system that reflects that? Or am I missing something.
that's not at all how it should work, but often how it does.
So IT should be in charge of defining RTO. My assumption was that this is an accounting thing where you define how long you can afford to be down for and design a system that can recover by, or before, that time period.
-
@coliver said in Recovery Time Objectives - How can I come up with a real world number...:
@scottalanmiller said in Recovery Time Objectives - How can I come up with a real world number...:
@coliver said in Recovery Time Objectives - How can I come up with a real world number...:
Isn't this a business thing? The business should mandate an RTO and IT should build a system that reflects that? Or am I missing something.
that's not at all how it should work, but often how it does.
So IT should be in charge of defining RTO.
Yes. RTO is generated by a combination of the financial figures provided by the business and IT applying those numbers to technical realities. The business is incapable of providing a usable RTO, only IT can do that. The business does not know what RTO is possible at what cost.
-
@scottalanmiller said in Recovery Time Objectives - How can I come up with a real world number...:
@coliver said in Recovery Time Objectives - How can I come up with a real world number...:
@scottalanmiller said in Recovery Time Objectives - How can I come up with a real world number...:
@coliver said in Recovery Time Objectives - How can I come up with a real world number...:
Isn't this a business thing? The business should mandate an RTO and IT should build a system that reflects that? Or am I missing something.
that's not at all how it should work, but often how it does.
So IT should be in charge of defining RTO.
Yes. RTO is generated by a combination of the financial figures provided by the business and IT applying those numbers to technical realities. The business is incapable of providing a usable RTO, only IT can do that. The business does not know what RTO is possible at what cost.
Ok, so I was half wrong. Thanks for clearing that up.
-
@coliver said in Recovery Time Objectives - How can I come up with a real world number...:
My assumption was that this is an accounting thing where you define how long you can afford to be down for and design a system that can recover by, or before, that time period.
That's one of the most dangerous business myths around IT. That there are these "lines" to be drawn. Like "we can be down one hour, but not two." It's completely not reflective of real life. If you say that "you cannot be down for two hours", you imply that it is worth one penny short of the entire potential value of the business to protect against a two hour outage. Obviously, that's absurd. But that is what that statement tells the IT department.
All disaster prevention and recovery is based around cost for protection. The more protection you want, the more it cost. How much it cost to be down and what the risk aversion is are business decisions. How that translates into usable RPO/RTO is that defined by IT based on those numbers. Otherwise, totally insane things happen like spending $100K to protect against a $5K outage.
-
In the real world companies lose money by the hour. No viable company can't be down for hours or days, most can be down for weeks or months. Not that it wouldn't hurt, but they can be and still survive. The "we can't be down for more than X" idea makes no sense because it basically says "don't bother recovering faster than this because we aren't saying that there is any value" and then "don't bother trying to recover if you can't make this line because we will be out of business." No business loses nothing for a day, then suddenly goes out of business taking all of their losses in one second.
-
I guess I mis-worded my original statement. Or didn't write it appropriately. I assumed that the cost of downtime vs the cost of a solution would be taken into account when defining the RTO. Although you've cleared it up significantly.
-
To answer the original question: How can I come up with a real world number...
You can't. Business systems are too complex to come up with a single figure. And disasters are always too unpredictable. The exercise is a bullshit marketing job to convince someone to spend some money.
IMHO
-
@Carnival-Boy said in Recovery Time Objectives - How can I come up with a real world number...:
To answer the original question: How can I come up with a real world number...
You can't. Business systems are too complex to come up with a single figure. And disasters are always too unpredictable. The exercise is a bullshit marketing job to convince someone to spend some money.
IMHO
I agree, it's not something that I think IT should be doing at all. You get numbers, you make a reasonable investment. You might have some guess as to recovery times which are useful for triage (like does it take one hour or ten hours to get systems back off of the tape) but RTO/RPO are just silly. In all my years I've never had an occasion to use them.
-
See I can agree with that @Carnival-Boy except that RTO in my mind should be the expected amount of time to recover from any of the possible scenarios out there.
IE. Restoring an individual file shouldn't take more than a few minutes.
I'm just trying to put some real world time down for some of the realistic events that might occur, but even that seems difficult.
-
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
See I can agree with that @Carnival-Boy except that RTO in my mind should be the expected amount of time to recover from any of the possible scenarios out there.
That's never predictable. What if the network fails? What if the medium fails? What if the server is under load? What if things have changed?
It's not a totally useless number, but it is mostly useless.
-
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
IE. Restoring an individual file shouldn't take more than a few minutes.
Even at a Fortune 10 bank restores were (one minute to two days.)
-
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
See I can agree with that @Carnival-Boy except that RTO in my mind should be the expected amount of time to recover from any of the possible scenarios out there.
IE. Restoring an individual file shouldn't take more than a few minutes.
I'm just trying to put some real world time down for some of the realistic events that might occur, but even that seems difficult.
So go cause some of these realistic events (with your own data, of course) and see how long it takes to recover from them... and then double it... If it takes you 15 minutes to recover that word document you just accidentally deleted on purpose, then I would suggest recording that it could take 30 minutes or an hour to restore a file for someone.
-
So if RTO/RPO is mostly useless, how do you quantify a number at which is unacceptable for a business outage?
-
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
I'm just trying to put some real world time down for some of the realistic events that might occur, but even that seems difficult.
Because it IS difficult. Let's ask the same thing in some other terms...
How long "should" a file transfer from point A to point B take? If you ask the business they will tell you how long they want it to take. Ask IT and they will figure out how fast the wire can transfer it. Actually do it and find out that the bottlenecks were not where you thought that they were and the system is not pristine while doing so and that it takes an unpredictable amount of time because IT systems are complex, we can't accurately predict this stuff. We can guess, but the farther out, the less common the operation, the bigger the guess.
You can simulate some disasters and test some things. That's the best you can do, and it isn't very good.
-
@dafyre said in Recovery Time Objectives - How can I come up with a real world number...:
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
See I can agree with that @Carnival-Boy except that RTO in my mind should be the expected amount of time to recover from any of the possible scenarios out there.
IE. Restoring an individual file shouldn't take more than a few minutes.
I'm just trying to put some real world time down for some of the realistic events that might occur, but even that seems difficult.
So go cause some of these realistic events (with your own data, of course) and see how long it takes to recover from them... and then double it... If it takes you 15 minutes to recover that word document you just accidentally deleted on purpose, then I would suggest recording that it could take 30 minutes or an hour to restore a file for someone.
And do it while you are home, your car is out of gas, you aren't dressed, your phone battery has died, the server is down, the tape is buried under paperwork, you don't have good labels and the person asking doesn't know the name of the file.
-
@scottalanmiller said in Recovery Time Objectives - How can I come up with a real world number...:
@dafyre said in Recovery Time Objectives - How can I come up with a real world number...:
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
See I can agree with that @Carnival-Boy except that RTO in my mind should be the expected amount of time to recover from any of the possible scenarios out there.
IE. Restoring an individual file shouldn't take more than a few minutes.
I'm just trying to put some real world time down for some of the realistic events that might occur, but even that seems difficult.
So go cause some of these realistic events (with your own data, of course) and see how long it takes to recover from them... and then double it... If it takes you 15 minutes to recover that word document you just accidentally deleted on purpose, then I would suggest recording that it could take 30 minutes or an hour to restore a file for someone.
And do it while you are home, your car is out of gas, you aren't dressed, your phone battery has died, the server is down, the tape is buried under paperwork, you don't have good labels and the person asking doesn't know the name of the file.
That's pretty darn realistic right there.
-
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
So if RTO/RPO is mostly useless, how do you quantify a number at which is unacceptable for a business outage?
That's the fundamental flaw. There is no such number and cannot be. That's the danger of the RTO concept, that someone might actually think that such a number exists.
-
@scottalanmiller said in Recovery Time Objectives - How can I come up with a real world number...:
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
So if RTO/RPO is mostly useless, how do you quantify a number at which is unacceptable for a business outage?
That's the fundamental flaw. There is no such number and cannot be. That's the danger of the RTO concept, that someone might actually think that such a number exists.
Sorry my point is, how do you design a backup and recovery system if this is such a flawed goal? How do you define the recovery objective and systems to implement it?
-
@DustinB3403 said in Recovery Time Objectives - How can I come up with a real world number...:
Sorry my point is, how do you design a backup and recovery system if this is such a flawed goal? How do you define the recovery objective and systems to implement it?
It's about curves. Think calculus. You have a cost curve that shows how much it costs you (losses) to be down over time (remember this is complex because we might be talking about a file or a VM or the entire infrastructure.) What does a file recover cost you? $20/day? Less, probably.
Then you have a curve of what it cost to recover at different time intervals. This tends to be a jagged curve because of tech leaps. Like jumping from GigE to 10GigE jumps the price but REALLY improves performance.
Then you compare the curves to see where the sweet spot is for the business based on the likeliness of the event.