Amazon S3 Outage shows the danger of doing things cheaply.

Deleted74295

https://medium.com/dara-it/amazon-s3-outage-shows-the-danger-of-doing-things-cheaply-ce335e5b7edf#.bormdw6jd

The TLDR is, Amazon like any technology tool can be extremely reliable or you can still use it with an element of risk. How much time and money you spend on the tool will make all the difference.

coliver

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

Dashrender

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

I'm thinking it would rarely be worth the expense - doubling the requirements needed, basically doubling their costs.

I suppose if they could get away with a fraction of the size in a desperate location, might be worth while. But then the concern is, will the reduced size be able to handle the load, and if not, will it fail completely anyways because of the demand?

coliver

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

I'm thinking it would rarely be worth the expense - doubling the requirements needed, basically doubling their costs.

I suppose if they could get away with a fraction of the size in a desperate location, might be worth while. But then the concern is, will the reduced size be able to handle the load, and if not, will it fail completely anyways because of the demand?

Potentially. AWS and S3 do offer some interesting load balancing that would be cheaper then double the price.... still not sure it would be worth the expense though.

Carnival Boy

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience. I move from on-premise to Amazon precisely because they have the economies of scale and expertise to manage the infrastructure better than I can. So of course I can blame them if they fail, unless it's within their SLA (which I don't think they have?). If I have to start bringing the management of resilience and redundancy back in-house, then part of the point of cloud services disappears. It has nothing to do with cost.

Dashrender

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience. I move from on-premise to Amazon precisely because they have the economies of scale and expertise to manage the infrastructure better than I can. So of course I can blame them if they fail, unless it's within their SLA (which I don't think they have?). If I have to start bringing the management of resilience and redundancy back in-house, then part of the point of cloud services disappears. It has nothing to do with cost.

Well you can definitely outsource all of those things, but just because you move things to AWS or any cloud service doesn't make you instantly failure proof. If you need a certain level of uptime, whomever is managing this for your, be it you or someone you hire, has to know your expectations and purchase accordingly.

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

Deleted74295

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

What you are buying is access to resources on their platform, you can buy resources in Asia, US, Europe but it is completely up to you to design and manage these resources so they work according to your needs and are fit for purpose.

Amazon are indeed the best provider but traditional wisdom still applies, good system design, failure/recovery planning and calculating the costs of the potential risk versus the spend to prevent it.

Deleted74295

@coliver said

Potentially. AWS and S3 do offer some interesting load balancing that would be cheaper then double the price.... still not sure it would be worth the expense though.

AWS costs are a bit of mine-field so for simplicity of math lets assume equal spend if you want copies.

S3 does indeed have geo distribution for the storage but that costs more money.

Also...
http://searchaws.techtarget.com/news/2240223024/Code-Spaces-goes-dark-after-AWS-cloud-security-hack

If all of your eggs are inside a single management portal, what happens if that 1 management portal gets breached?

coliver

@Breffni-Potter said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said

Potentially. AWS and S3 do offer some interesting load balancing that would be cheaper then double the price.... still not sure it would be worth the expense though.

AWS costs are a bit of mine-field so for simplicity of math lets assume equal spend if you want copies.

S3 does indeed have geo distribution for the storage but that costs more money.

Also...
http://searchaws.techtarget.com/news/2240223024/Code-Spaces-goes-dark-after-AWS-cloud-security-hack

If all of your eggs are inside a single management portal, what happens if that 1 management portal gets breached?

But that still doesn't really answer the question. What is the cost of building out that resiliency and does it actually benefit the company? My guess, in a lot of cases, that the cost of building resiliency costs more then the cost of downtime. The code spaces issue was entirely on them. Not having a decent backup system isn't the fault of the cloud, AWS, or the management engine.

Deleted74295

@coliver It depends, in this article I just looked at S3 storage alone and without any bandwidth charges.

What is the app, how much storage, how much processing, power and memory, how much bandwidth, how many users, which geographic locations they need to serve. Complexity of the app or service to install/update/deploy

Its like any IT project. What is the risk of this happening, what is the cost of prevention and making an informed decision. The same thought process and decision making that goes into a single server setup or dual server setup on prem stays when we look into cloud computing.

coliver

@Breffni-Potter said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver It depends, in this article I just looked at S3 storage alone and without any bandwidth charges.

What is the app, how much storage, how much processing, power and memory how much bandwidth, how many users, which geographic locations they need to serve. Complexity of the app or service

Its like any IT project. What is the risk of this happening, what is the cost of prevention and making an informed decision. The same thought process and decision making that goes into a single server setup or dual server setup on prem stays when we look into cloud computing.

Agreed completely. Which is why I bring it up. This may not be an issue of companies being cheap or doing the cheapest thing for the sake of expense. This is likely companies being fiscally responsible and choosing the best option with the greatest return. S3, and AWS, have been ridiculously reliable, most times having only a few hours of unplanned outages a year, and even fewer planned outages. I agree companies can't place the blame on AWS when things go down, but they can say our infrastructure is having issues due to an AWS outage.

scottalanmiller

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

scottalanmiller

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

I'm thinking it would rarely be worth the expense - doubling the requirements needed, basically doubling their costs.

I suppose if they could get away with a fraction of the size in a desperate location, might be worth while. But then the concern is, will the reduced size be able to handle the load, and if not, will it fail completely anyways because of the demand?

Yup. Big companies normally consider the risks versus the costs and determine if protecting against something makes sense.

coliver

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

I get that, I would like to see the numbers related to the most recent outage though, for purely academic reasons. I just think it would be interesting. Ignoring the risk entirely for a moment, my guess is that having the infrastructure for a year to protect against this, unlikely, event would have still cost more then the downtime itself cost.

scottalanmiller

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

I get that, I would like to see the numbers related to the most recent outage though, for purely academic reasons. I just think it would be interesting. Ignoring the risk entirely for a moment, my guess is that having the infrastructure for a year to protect against this, unlikely, event would have still cost more then the downtime itself cost.

Very easily for sure.

Carnival Boy

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

A bit below 100%. Their uptime is from using multiple data centers.

travisdh1

Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?

scottalanmiller

@travisdh1 said in Amazon S3 Outage shows the danger of doing things cheaply.:

Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?

Always something simple.

Dashrender

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

DC as in Datacenter failover - i.e. this DC is offline for whatever reason, so now your data/services is running in another DC?
Even if it is listed at 100% the SLA just gives them an out when they don't meet it, i.e. they get to send you a check, that's it. Nothing more. I wouldn't expect them to realtime clone your data to another DC unless you're paying for that feature, at which point your SLA would be even higher, and you're paying a TON more.