Amazon S3 Outage shows the danger of doing things cheaply.
-
The big deal about S3 is the durability. They simply never lose data, ever. You might lose access for a few hours, but the data will always be there.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
Less than I expect from an average server.
It's about an hour a year, I think? We probably get roughly that from our servers because of scheduled reboots and upgrades etc etc. In terms of unplanned downtime, not sure.
-
@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
Less than I expect from an average server.
It's about an hour a year, I think? We probably get roughly that from our servers because of scheduled reboots and upgrades etc etc. In terms of unplanned downtime, not sure.
Yeah, generally we don't count planned downtime - partially because we typically discuss the server level in house, not the software level just.... because often we have no control over the later. And partially because it's very different, it's downtime when downtime is approved. It is "down" but not down how people normally mean it.
-
And if a server only reboots, is that really down? The services on top go down during the reboot, but the server itself never fails or goes down, it just switches from a full running to a restarting state. But it is always running correctly.
-
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."
I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.
-
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.
Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."
I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.
Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.
But you have to keep them from believing that they were told that "the server was down" or bad information gets repeated. Of course you have to extrapolate, but you can always clarify. If you don't require people to speak accurately, you validate misinformation and you become part of the problem if you are not careful. If management wants under the hood information, that information needs to be correct.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
-
@scottalanmiller Agree. Totally. When asked "Is the server down." you can say "The server was rebooting as planned, as the application was not part of a cluster, the application was offline". But to say "No, the server was fine." is misleading.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller Agree. Totally. When asked "Is the server down." you can say "The server was rebooting as planned, as the application was not part of a cluster, the application was offline". But to say "No, the server was fine." is misleading.
Sure, that's all that I am saying. Don't say that the server is fine without explanation, but don't agree that it is down when it's not what was asked or meant.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
Well as I said to someone privately, an error is a form of outage.
-
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
Well as I said to someone privately, an error is a form of outage.
'Error', Its misleading. Wrong word to use. AWS lost a lot of respect in my eyes. Its an accident, a mistake, a Fu** Up. Not more errors than usual.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:
If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?
Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.
You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.
If the server is being restarted, and the application has not been temp moved to another server, that server/service/application/whatever is down. If a tech said to me "Nah, the server is on, has power, and is just restarting so its actually up, we just didn't bother to move your application to another node in the cluster first - but its up." - they are wrong. Its down.
Not if it is during a greenzone, that would be misinformation. Is it down? yes, but it is supposed to be. Do you say that your car is "missing" when the police ask you, or do you admit that you sold it?
Like you said, you have to extrapolate and what they mean by "down" is that it is "down when it is supposed to be up", which is not true. It's down when it is supposed to be down.
Amazons servers were down, rebooting, restarting etc, when they were supposed to be up. "Is the server down" = "The servers were rebooted by accident and all applications were unavailable.", not "No, were just experiencing higher errors than usual" - Pfft
Well as I said to someone privately, an error is a form of outage.
'Error', Its misleading. Wrong word to use. AWS lost a lot of respect in my eyes. Its an accident, a mistake, a Fu** Up. Not more errors than usual.
They admitted to the mistake that caused the errors. The question is... did anyone get data, just at a high error rate? If so, then they should not lose any respect. My understanding is that their statement was totally accurate and correct. I am not okay with them just calling things "down" if they are not fully down and while I didn't test, what I heard was that it was not fully down. Reporting the truth is important because their customers getting some data might have reported the system fixed when it was still at the high error rate, for example.
-
The high error rate might have impacted services, too. For example, "down detection" might have seen the service as still online, even though it was not working properly. Reporting only that it was "down" would have been inaccurate. Possibly impactfully so. So I see the opposite, AWS seems to have acted perfectly. They took full blame and communicated correctly. What could they have improved, other than not having the incident at all?
-
@scottalanmiller "We accidently restarted core servers which as a result brought many services offline and impacted many customers." - that is what I consider accurate here. "We experienced higher than usual error rates" is a cop-out.
-
@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:
@scottalanmiller "We accidently restarted core servers which as a result brought many services offline and impacted many customers." - that is what I consider accurate here. "We experienced higher than usual error rates" is a cop-out.
But they admitted the later once they knew what had happened. While it was happening they only knew that they had high error rates and high error rates is what people using S3 needed to know. It's not a cop out at all, not in any way. The truth cannot be a cop out, by definition.
A cop out would be services that rely on S3 being down and blaming S3 instead of admitting that they opted out of redundancy when S3 delivered on their SLA correctly.