Amazon S3 Outage shows the danger of doing things cheaply.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."
I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.
Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.
But you have to keep them from believing that they were told that "the server was down" or bad information gets repeated. Of course you have to extrapolate, but you can always clarify. If you don't require people to speak accurately, you validate misinformation and you become part of the problem if you are not careful. If management wants under the hood information, that information needs to be correct.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
-
@scottalanmiller Agree. Totally. When asked "Is the server down." you can say "The server was rebooting as planned, as the application was not part of a cluster, the application was offline". But to say "No, the server was fine." is misleading.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller Agree. Totally. When asked "Is the server down." you can say "The server was rebooting as planned, as the application was not part of a cluster, the application was offline". But to say "No, the server was fine." is misleading.
Sure, that's all that I am saying. Don't say that the server is fine without explanation, but don't agree that it is down when it's not what was asked or meant.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
Well as I said to someone privately, an error is a form of outage.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
Well as I said to someone privately, an error is a form of outage.
'Error', Its misleading. Wrong word to use. AWS lost a lot of respect in my eyes. Its an accident, a mistake, a Fu** Up. Not more errors than usual.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
Well as I said to someone privately, an error is a form of outage.
'Error', Its misleading. Wrong word to use. AWS lost a lot of respect in my eyes. Its an accident, a mistake, a Fu** Up. Not more errors than usual.
They admitted to the mistake that caused the errors. The question is... did anyone get data, just at a high error rate? If so, then they should not lose any respect. My understanding is that their statement was totally accurate and correct. I am not okay with them just calling things "down" if they are not fully down and while I didn't test, what I heard was that it was not fully down. Reporting the truth is important because their customers getting some data might have reported the system fixed when it was still at the high error rate, for example.
-
The high error rate might have impacted services, too. For example, "down detection" might have seen the service as still online, even though it was not working properly. Reporting only that it was "down" would have been inaccurate. Possibly impactfully so. So I see the opposite, AWS seems to have acted perfectly. They took full blame and communicated correctly. What could they have improved, other than not having the incident at all?
-
@scottalanmiller "We accidently restarted core servers which as a result brought many services offline and impacted many customers." - that is what I consider accurate here. "We experienced higher than usual error rates" is a cop-out.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller "We accidently restarted core servers which as a result brought many services offline and impacted many customers." - that is what I consider accurate here. "We experienced higher than usual error rates" is a cop-out.
But they admitted the later once they knew what had happened. While it was happening they only knew that they had high error rates and high error rates is what people using S3 needed to know. It's not a cop out at all, not in any way. The truth cannot be a cop out, by definition.
A cop out would be services that rely on S3 being down and blaming S3 instead of admitting that they opted out of redundancy when S3 delivered on their SLA correctly.
-
The problem is that if you are Amazon's customer, the high error rates is what you need to know and you are the ONLY people to whom Amazon is providing info. If you are a customer of Amazon's customers, then it is the job if THOSE companies to say that high they are down. If they say that they are down because S3 has high error rates and that they could not handle the error rates and could not failover to another site that was not down would be fine, or to just say that hey were down, would be fine. But Amazon was 100% spot on for their reporting, IMHO. Zero cop out.
-
There is no doubt that Amazon had a bad day yesterday. But let's keep the facts straight which are, AFAIK:
- Amazon reported the outage accurately.
- Amazon resolved the outage promptly and well.
- Amazon maintained other datacenters so customers following Amazon's advice were not impacted.
- Amazon maintained their SLA.
- Amazon admitted all fault.
-
@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:
Right. So why would you think that I would think that if I put data in just one DC I would have DC failover? That doesn't make any sense.
Using AWS doesn't automatically mean you have geophysical diversity - do you think it does? what makes you think that?
-
@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Breffni-Potter said:
@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:
I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.
But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.
Clearly I have a misconception. I'm not an Amazon customer, but looking at their website, they say things like:
Designed for 99.999999999% durability and 99.99% availability of objects over a given year.
Designed to sustain the concurrent loss of data in two facilities.
Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility.All of this seems to me that they are selling resilience. If I read "designed for 99.99%" and then only got 90% availability, would it be fair for Amazon to say "yeah, but that's your fault, we never sold you resilience?" I don't think so.
If the argument we're having is "you're not paying for 100% availability" then I agree with you. If your argument is "you're not paying for resilience" then I struggle to agree with you.
I see there are about a dozen replies already - but I'm replying out of time anyhow.
Sure, their system is "designed" for that, but you still have to buy that level. You have the option to buy many different levels of nines based on your need.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."
I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.
Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.
Sure - but at the same time it's your job to help break those people from lumping all those things together.
is the car working - answer - yes, the CAR is working, but the stereo is not. If they ask you why you said yes to the car is working, you explain that one has nothing to do with the other, and that it's important for them to understand that so they can ensure the correct people are looking at the actual problem.
-
@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."
I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.
Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.
Sure - but at the same time it's your job to help break those people from lumping all those things together.
is the car working - answer - yes, the CAR is working, but the stereo is not. If they ask you why you said yes to the car is working, you explain that one has nothing to do with the other, and that it's important for them to understand that so they can ensure the correct people are looking at the actual problem.
That was my point. Don't just let them be wrong.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."
I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.
Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.
Sure - but at the same time it's your job to help break those people from lumping all those things together.
is the car working - answer - yes, the CAR is working, but the stereo is not. If they ask you why you said yes to the car is working, you explain that one has nothing to do with the other, and that it's important for them to understand that so they can ensure the correct people are looking at the actual problem.
That was my point. Don't just let them be wrong.
Right, I was driving your point home.