Amazon S3 Outage shows the danger of doing things cheaply.

scottalanmiller

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

I get that, I would like to see the numbers related to the most recent outage though, for purely academic reasons. I just think it would be interesting. Ignoring the risk entirely for a moment, my guess is that having the infrastructure for a year to protect against this, unlikely, event would have still cost more then the downtime itself cost.

Very easily for sure.

Carnival Boy

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

A bit below 100%. Their uptime is from using multiple data centers.

travisdh1

Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?

scottalanmiller

@travisdh1 said in Amazon S3 Outage shows the danger of doing things cheaply.:

Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?

Always something simple.

Dashrender

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

DC as in Datacenter failover - i.e. this DC is offline for whatever reason, so now your data/services is running in another DC?
Even if it is listed at 100% the SLA just gives them an out when they don't meet it, i.e. they get to send you a check, that's it. Nothing more. I wouldn't expect them to realtime clone your data to another DC unless you're paying for that feature, at which point your SLA would be even higher, and you're paying a TON more.

Carnival Boy

Right. So why would you think that I would think that if I put data in just one DC I would have DC failover? That doesn't make any sense.

Jimmy9008

I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.

scottalanmiller

@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:

I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.

Oh exactly. It is what it is. Single DC dependency, and on a single service in that DC. AWS tells us what to do if we need higher reliability than that. They were within SLA, I believe. It's all good.

Jimmy9008

@scottalanmiller

Exactly. We host everything at HQ on site, with a colo in Essex for DR purposes. If HQ was lost, staff are screwed until we restore from backups BUT... customers (the important part) are not really affected at all. We keep a hot copy of our websites and databases running a day out of date (which the business are fine with) in the Essex colo. We then use 'the cloud' to manage the failover process, which is a cheap solution compared to multiple Cloud DC's hosting everything.

We have one VM in Azure, and one in AWS. Both check our websites hosted at HQ are available on HTTP/HTTPS every second or so. If not responding, they will use Cloudflare API and point DNS for all our websites to the hot running copies in the colo that are a day out of date. Pretty fast. When tested, it takes seconds and were back online from a customer perspective. Our test, unplug out gateway firewall and see what happens... easy.

Yeah it can be better, but it meets our needs and other than cloudflare (which does go down) we have no single point of failure... We're happy with that risk.

Carnival Boy

@Breffni-Potter said:

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

Clearly I have a misconception. I'm not an Amazon customer, but looking at their website, they say things like:
Designed for 99.999999999% durability and 99.99% availability of objects over a given year.
Designed to sustain the concurrent loss of data in two facilities.
Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility.

All of this seems to me that they are selling resilience. If I read "designed for 99.99%" and then only got 90% availability, would it be fair for Amazon to say "yeah, but that's your fault, we never sold you resilience?" I don't think so.

If the argument we're having is "you're not paying for 100% availability" then I agree with you. If your argument is "you're not paying for resilience" then I struggle to agree with you.

Jimmy9008

@Carnival-Boy Agree. Nice.

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Breffni-Potter said:

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

Clearly I have a misconception. I'm not an Amazon customer, but looking at their website, they say things like:
Designed for 99.999999999% durability and 99.99% availability of objects over a given year.
Designed to sustain the concurrent loss of data in two facilities.
Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility.

All of this seems to me that they are selling resilience. If I read "designed for 99.99%" and then only got 90% availability, would it be fair for Amazon to say "yeah, but that's your fault, we never sold you resilience?" I don't think so.

99.99% is very low availability. 99.999% is "standard" availability. High availability is 99.9999%. They are selling 99.99% uptime, that can't be considered "selling reliability" as it is far too unreliable for that. It's fine for most customers, most customers don't need much availability.

So I read the same thing as saying "designed for 99.99% availability" which is a direct statement making it super clear that Amazon S3, unless you do things yourself to make it high availability, is not at all designed for "availability" as a target feature. To me, they've clarified that in what you quoted to make sure we don't assume that availability is their specialty.

And they meet 99.99% with ease. 90% would mean that they were down for nearly a month, not an afternoon.

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

If your argument is "you're not paying for resilience" then I struggle to agree with you.

You are paying for a very specific level of resilience which is considered "low". So you "are paying for resilience", but not high resilience.

Carnival Boy

Yeah, I'd agree with that. You're right that 99.99% is low.

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

Yeah, I'd agree with that. You're right that 99.99% is low.

Or low-ish at least. It's four nines. It's more than I expect from an average SAN Less than I expect from an average server.

scottalanmiller

The big deal about S3 is the durability. They simply never lose data, ever. You might lose access for a few hours, but the data will always be there.

Carnival Boy

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

Less than I expect from an average server.

It's about an hour a year, I think? We probably get roughly that from our servers because of scheduled reboots and upgrades etc etc. In terms of unplanned downtime, not sure.

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

Less than I expect from an average server.

It's about an hour a year, I think? We probably get roughly that from our servers because of scheduled reboots and upgrades etc etc. In terms of unplanned downtime, not sure.

Yeah, generally we don't count planned downtime - partially because we typically discuss the server level in house, not the software level just.... because often we have no control over the later. And partially because it's very different, it's downtime when downtime is approved. It is "down" but not down how people normally mean it.

scottalanmiller

And if a server only reboots, is that really down? The services on top go down during the reboot, but the server itself never fails or goes down, it just switches from a full running to a restarting state. But it is always running correctly.