Amazon S3 Outage shows the danger of doing things cheaply.

scottalanmiller

@travisdh1 said in Amazon S3 Outage shows the danger of doing things cheaply.:

Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?

Always something simple.

Dashrender

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

DC as in Datacenter failover - i.e. this DC is offline for whatever reason, so now your data/services is running in another DC?
Even if it is listed at 100% the SLA just gives them an out when they don't meet it, i.e. they get to send you a check, that's it. Nothing more. I wouldn't expect them to realtime clone your data to another DC unless you're paying for that feature, at which point your SLA would be even higher, and you're paying a TON more.

Carnival Boy

Right. So why would you think that I would think that if I put data in just one DC I would have DC failover? That doesn't make any sense.

Jimmy9008

I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.

scottalanmiller

@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:

I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.

Oh exactly. It is what it is. Single DC dependency, and on a single service in that DC. AWS tells us what to do if we need higher reliability than that. They were within SLA, I believe. It's all good.

Jimmy9008

@scottalanmiller

Exactly. We host everything at HQ on site, with a colo in Essex for DR purposes. If HQ was lost, staff are screwed until we restore from backups BUT... customers (the important part) are not really affected at all. We keep a hot copy of our websites and databases running a day out of date (which the business are fine with) in the Essex colo. We then use 'the cloud' to manage the failover process, which is a cheap solution compared to multiple Cloud DC's hosting everything.

We have one VM in Azure, and one in AWS. Both check our websites hosted at HQ are available on HTTP/HTTPS every second or so. If not responding, they will use Cloudflare API and point DNS for all our websites to the hot running copies in the colo that are a day out of date. Pretty fast. When tested, it takes seconds and were back online from a customer perspective. Our test, unplug out gateway firewall and see what happens... easy.

Yeah it can be better, but it meets our needs and other than cloudflare (which does go down) we have no single point of failure... We're happy with that risk.

Carnival Boy

@Breffni-Potter said:

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

Clearly I have a misconception. I'm not an Amazon customer, but looking at their website, they say things like:
Designed for 99.999999999% durability and 99.99% availability of objects over a given year.
Designed to sustain the concurrent loss of data in two facilities.
Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility.

All of this seems to me that they are selling resilience. If I read "designed for 99.99%" and then only got 90% availability, would it be fair for Amazon to say "yeah, but that's your fault, we never sold you resilience?" I don't think so.

If the argument we're having is "you're not paying for 100% availability" then I agree with you. If your argument is "you're not paying for resilience" then I struggle to agree with you.

Jimmy9008

@Carnival-Boy Agree. Nice.

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Breffni-Potter said:

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

Clearly I have a misconception. I'm not an Amazon customer, but looking at their website, they say things like:
Designed for 99.999999999% durability and 99.99% availability of objects over a given year.
Designed to sustain the concurrent loss of data in two facilities.
Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility.

All of this seems to me that they are selling resilience. If I read "designed for 99.99%" and then only got 90% availability, would it be fair for Amazon to say "yeah, but that's your fault, we never sold you resilience?" I don't think so.

99.99% is very low availability. 99.999% is "standard" availability. High availability is 99.9999%. They are selling 99.99% uptime, that can't be considered "selling reliability" as it is far too unreliable for that. It's fine for most customers, most customers don't need much availability.

So I read the same thing as saying "designed for 99.99% availability" which is a direct statement making it super clear that Amazon S3, unless you do things yourself to make it high availability, is not at all designed for "availability" as a target feature. To me, they've clarified that in what you quoted to make sure we don't assume that availability is their specialty.

And they meet 99.99% with ease. 90% would mean that they were down for nearly a month, not an afternoon.

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

If your argument is "you're not paying for resilience" then I struggle to agree with you.

You are paying for a very specific level of resilience which is considered "low". So you "are paying for resilience", but not high resilience.

Carnival Boy

Yeah, I'd agree with that. You're right that 99.99% is low.

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

Yeah, I'd agree with that. You're right that 99.99% is low.

Or low-ish at least. It's four nines. It's more than I expect from an average SAN Less than I expect from an average server.

scottalanmiller

The big deal about S3 is the durability. They simply never lose data, ever. You might lose access for a few hours, but the data will always be there.

Carnival Boy

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

Less than I expect from an average server.

It's about an hour a year, I think? We probably get roughly that from our servers because of scheduled reboots and upgrades etc etc. In terms of unplanned downtime, not sure.

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

Less than I expect from an average server.

It's about an hour a year, I think? We probably get roughly that from our servers because of scheduled reboots and upgrades etc etc. In terms of unplanned downtime, not sure.

Yeah, generally we don't count planned downtime - partially because we typically discuss the server level in house, not the software level just.... because often we have no control over the later. And partially because it's very different, it's downtime when downtime is approved. It is "down" but not down how people normally mean it.

scottalanmiller

And if a server only reboots, is that really down? The services on top go down during the reboot, but the server itself never fails or goes down, it just switches from a full running to a restarting state. But it is always running correctly.

Jimmy9008

@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.

scottalanmiller

@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:

@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.

Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."

I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.

scottalanmiller

If your business people start trying to get under the hood, whatever you do don't condescend and talk to them like they are children. Give them the facts, don't mislead them. If they ask if the server is down be clear - no, the server did not experience a failure, but there might be an issue with the application or some other part of the system, what issue are you asking about?

Playing the "management are children" game is dangerous because that's how we get dangerous situations like a manager asking "is the server down", getting the incorrect answer of "yes", then demanding that Dell be brought in to answer for something that has nothing to do with them and then randomly switching vendors on the next purchase because they were told that the server failed, when it did not. It makes them look foolish to staff, to vendors, undermines their vendor relationships and makes them impotent at business decisions because they think that they have info that they do not.

You can't control if management wants to take over IT, but don't try to abstract IT in a condescending way to enable this, it can't provide a good outcome.

Jimmy9008

@scottalanmiller

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:

@scottalanmiller When a business person says "Was the server down" - they don't many the physical box. They mean whatever that box provides to them, or the customers, or whatever that box does... The fact that its not off, its rebooting, doesn't matter - 'what it does' was down... for four hours.

Yes and no. Was the server down? No, it was in a greenzone. I speak to business people only about this stuff, and "offline in a down time" is never considered "down". And an application down to the system team is "not down." Is the server down? "Nope, go look into your app."

I realize that in the SMB people tend to look at one person for all of the questions from networking to systems to apps, but in the enterprise, if someone asks if the server is down the question is about a server, not an application. If a business person wants to know about X, they ask about X - competent business people don't start digging under the hood and asking the wrong questions to "sound cool." If the car doesn't work, a competent business person asks "is the car working" to their mechanic, they do not say "check if there is enough oil" when they actually want to know if the car works - clearly not a relevant question.

Part of work is extrapolating what people mean, even if they asked the question in the wrong way. "Is the car working" covers fuel, seats, keys, doors etc, the same as "Is the server down" covers it being in a reboot state... yes, what that server does is down.