Amazon S3 Outage shows the danger of doing things cheaply.

Deleted74295

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

What you are buying is access to resources on their platform, you can buy resources in Asia, US, Europe but it is completely up to you to design and manage these resources so they work according to your needs and are fit for purpose.

Amazon are indeed the best provider but traditional wisdom still applies, good system design, failure/recovery planning and calculating the costs of the potential risk versus the spend to prevent it.

Deleted74295

@coliver said

Potentially. AWS and S3 do offer some interesting load balancing that would be cheaper then double the price.... still not sure it would be worth the expense though.

AWS costs are a bit of mine-field so for simplicity of math lets assume equal spend if you want copies.

S3 does indeed have geo distribution for the storage but that costs more money.

Also...
http://searchaws.techtarget.com/news/2240223024/Code-Spaces-goes-dark-after-AWS-cloud-security-hack

If all of your eggs are inside a single management portal, what happens if that 1 management portal gets breached?

coliver

@Breffni-Potter said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said

Potentially. AWS and S3 do offer some interesting load balancing that would be cheaper then double the price.... still not sure it would be worth the expense though.

AWS costs are a bit of mine-field so for simplicity of math lets assume equal spend if you want copies.

S3 does indeed have geo distribution for the storage but that costs more money.

Also...
http://searchaws.techtarget.com/news/2240223024/Code-Spaces-goes-dark-after-AWS-cloud-security-hack

If all of your eggs are inside a single management portal, what happens if that 1 management portal gets breached?

But that still doesn't really answer the question. What is the cost of building out that resiliency and does it actually benefit the company? My guess, in a lot of cases, that the cost of building resiliency costs more then the cost of downtime. The code spaces issue was entirely on them. Not having a decent backup system isn't the fault of the cloud, AWS, or the management engine.

Deleted74295

@coliver It depends, in this article I just looked at S3 storage alone and without any bandwidth charges.

What is the app, how much storage, how much processing, power and memory, how much bandwidth, how many users, which geographic locations they need to serve. Complexity of the app or service to install/update/deploy

Its like any IT project. What is the risk of this happening, what is the cost of prevention and making an informed decision. The same thought process and decision making that goes into a single server setup or dual server setup on prem stays when we look into cloud computing.

coliver

@Breffni-Potter said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver It depends, in this article I just looked at S3 storage alone and without any bandwidth charges.

What is the app, how much storage, how much processing, power and memory how much bandwidth, how many users, which geographic locations they need to serve. Complexity of the app or service

Its like any IT project. What is the risk of this happening, what is the cost of prevention and making an informed decision. The same thought process and decision making that goes into a single server setup or dual server setup on prem stays when we look into cloud computing.

Agreed completely. Which is why I bring it up. This may not be an issue of companies being cheap or doing the cheapest thing for the sake of expense. This is likely companies being fiscally responsible and choosing the best option with the greatest return. S3, and AWS, have been ridiculously reliable, most times having only a few hours of unplanned outages a year, and even fewer planned outages. I agree companies can't place the blame on AWS when things go down, but they can say our infrastructure is having issues due to an AWS outage.

scottalanmiller

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

scottalanmiller

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

I'm thinking it would rarely be worth the expense - doubling the requirements needed, basically doubling their costs.

I suppose if they could get away with a fraction of the size in a desperate location, might be worth while. But then the concern is, will the reduced size be able to handle the load, and if not, will it fail completely anyways because of the demand?

Yup. Big companies normally consider the risks versus the costs and determine if protecting against something makes sense.

coliver

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

I get that, I would like to see the numbers related to the most recent outage though, for purely academic reasons. I just think it would be interesting. Ignoring the risk entirely for a moment, my guess is that having the infrastructure for a year to protect against this, unlikely, event would have still cost more then the downtime itself cost.

scottalanmiller

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

@scottalanmiller said in Amazon S3 Outage shows the danger of doing things cheaply.:

@coliver said in Amazon S3 Outage shows the danger of doing things cheaply.:

I would like to see the cost comparison. How much money did these companies lose due to this outage vs how much money they would use having a resilient infrastructure that is geographically load balanced. Is a few hours of downtime a year due to S3 going to break a lot of these companies?

That's missing a HUGE element which was the risk. Just because you have a bad outage doesn't mean that it was worth protecting against. You have to consider how likely it was to have happened regardless or whether or not it did happen.

I get that, I would like to see the numbers related to the most recent outage though, for purely academic reasons. I just think it would be interesting. Ignoring the risk entirely for a moment, my guess is that having the infrastructure for a year to protect against this, unlikely, event would have still cost more then the downtime itself cost.

Very easily for sure.

Carnival Boy

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

scottalanmiller

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

A bit below 100%. Their uptime is from using multiple data centers.

travisdh1

Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?

scottalanmiller

@travisdh1 said in Amazon S3 Outage shows the danger of doing things cheaply.:

Finally a followup article. Apparently just rebooted to many machines at the same time. Human error, is anyone surprised?

Always something simple.

Dashrender

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

@Dashrender said in Amazon S3 Outage shows the danger of doing things cheaply.:

Are you saying that you assume that simply by putting a VM (or actual cloud service) in AWS that you automatically assume you have full DC failover, etc? Why do you assume this?

I don't know what you mean "full DC failover"? I would assume I'd have uptime within the SLA or within published expectations of uptime, which in Amazon's case is about 100% I believe?

DC as in Datacenter failover - i.e. this DC is offline for whatever reason, so now your data/services is running in another DC?
Even if it is listed at 100% the SLA just gives them an out when they don't meet it, i.e. they get to send you a check, that's it. Nothing more. I wouldn't expect them to realtime clone your data to another DC unless you're paying for that feature, at which point your SLA would be even higher, and you're paying a TON more.

Carnival Boy

Right. So why would you think that I would think that if I put data in just one DC I would have DC failover? That doesn't make any sense.

Jimmy9008

I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.

scottalanmiller

@Jimmy9008 said in Amazon S3 Outage shows the danger of doing things cheaply.:

I don't blame AWS at all. If you use a service like AWS, use it properly and build for DR or take the risk! Think about the systems you use, and build properly.

Oh exactly. It is what it is. Single DC dependency, and on a single service in that DC. AWS tells us what to do if we need higher reliability than that. They were within SLA, I believe. It's all good.

Jimmy9008

@scottalanmiller

Exactly. We host everything at HQ on site, with a colo in Essex for DR purposes. If HQ was lost, staff are screwed until we restore from backups BUT... customers (the important part) are not really affected at all. We keep a hot copy of our websites and databases running a day out of date (which the business are fine with) in the Essex colo. We then use 'the cloud' to manage the failover process, which is a cheap solution compared to multiple Cloud DC's hosting everything.

We have one VM in Azure, and one in AWS. Both check our websites hosted at HQ are available on HTTP/HTTPS every second or so. If not responding, they will use Cloudflare API and point DNS for all our websites to the hot running copies in the colo that are a day out of date. Pretty fast. When tested, it takes seconds and were back online from a customer perspective. Our test, unplug out gateway firewall and see what happens... easy.

Yeah it can be better, but it meets our needs and other than cloudflare (which does go down) we have no single point of failure... We're happy with that risk.

Carnival Boy

@Breffni-Potter said:

@Carnival-Boy said in Amazon S3 Outage shows the danger of doing things cheaply.:

I disagree with the article. One of the main reasons I would move to a cloud service is to outsource my redundancy and resilience.

But you don't buy any of that from Amazon. This is the biggest misconception about cloud computing.

Clearly I have a misconception. I'm not an Amazon customer, but looking at their website, they say things like:
Designed for 99.999999999% durability and 99.99% availability of objects over a given year.
Designed to sustain the concurrent loss of data in two facilities.
Amazon S3 redundantly stores data in multiple facilities and on multiple devices within each facility.

All of this seems to me that they are selling resilience. If I read "designed for 99.99%" and then only got 90% availability, would it be fair for Amazon to say "yeah, but that's your fault, we never sold you resilience?" I don't think so.

If the argument we're having is "you're not paying for 100% availability" then I agree with you. If your argument is "you're not paying for resilience" then I struggle to agree with you.

Jimmy9008

@Carnival-Boy Agree. Nice.