ML
    • Recent
    • Categories
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login

    AWS Catastrophic Data Loss

    IT Discussion
    12
    76
    3.6k
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • PhlipElderP
      PhlipElder @dafyre
      last edited by

      @dafyre said in AWS Catastrophic Data Loss:

      @Dashrender said in AWS Catastrophic Data Loss:

      @dafyre said in AWS Catastrophic Data Loss:

      @Pete-S said in AWS Catastrophic Data Loss:

      @dafyre said in AWS Catastrophic Data Loss:

      @Pete-S said in AWS Catastrophic Data Loss:

      Message From Amazon AWS :

      Update August 28, 2019 JST:

      That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

      It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

      Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

      It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

      eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

      The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

      That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

      It's all fly by the seat of the pants theory until the sh#t statement above happens.

      1 dafyreD 2 Replies Last reply Reply Quote 0
      • 1
        1337 @PhlipElder
        last edited by

        @PhlipElder said in AWS Catastrophic Data Loss:

        @dafyre said in AWS Catastrophic Data Loss:

        @Dashrender said in AWS Catastrophic Data Loss:

        @dafyre said in AWS Catastrophic Data Loss:

        @Pete-S said in AWS Catastrophic Data Loss:

        @dafyre said in AWS Catastrophic Data Loss:

        @Pete-S said in AWS Catastrophic Data Loss:

        Message From Amazon AWS :

        Update August 28, 2019 JST:

        That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

        It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

        Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

        It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

        eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

        The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

        That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

        It's all fly by the seat of the pants theory until the sh#t statement above happens.

        I would argue and say that making something fail-safe is not the problem. It's most likely that they didn't think it was important enough to invest enough time and money in their hyperscale datacenter to make sure it wouldn't fail on nonsense like this. After all the goal is to make money, not spend more than needed.

        The technology and knowledge exists because it's used in other industries were failures will result in death and catastrophe.

        dafyreD 1 Reply Last reply Reply Quote 0
        • dafyreD
          dafyre @1337
          last edited by

          @Pete-S said in AWS Catastrophic Data Loss:

          @PhlipElder said in AWS Catastrophic Data Loss:

          @dafyre said in AWS Catastrophic Data Loss:

          @Dashrender said in AWS Catastrophic Data Loss:

          @dafyre said in AWS Catastrophic Data Loss:

          @Pete-S said in AWS Catastrophic Data Loss:

          @dafyre said in AWS Catastrophic Data Loss:

          @Pete-S said in AWS Catastrophic Data Loss:

          Message From Amazon AWS :

          Update August 28, 2019 JST:

          That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

          It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

          Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

          It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

          eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

          The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

          That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

          It's all fly by the seat of the pants theory until the sh#t statement above happens.

          I would argue and say that making something fail-safe is not the problem. It's most likely that they didn't think it was important enough to invest enough time and money in their hyperscale datacenter to make sure it wouldn't fail on nonsense like this. After all the goal is to make money, not spend more than needed.

          The technology and knowledge exists because it's used in other industries were failures will result in death and catastrophe.

          I agree with you here somewhat. There is no such thing as "won't fail." There's always a chance of failure. The more money you can throw at it, the less likely a cascade of failures is to happen.

          Having worked in IT a long time, I can count on one hand the number of times I've sen PLCs fail in similar situations. But they are electronics and they're not going to be 100% reliable when there are voltage spikes and power brown outs.

          Their goal is definitely to make money, but they also have to spend enough to protect their reputation when stuff like this does happen (and it will).

          1 Reply Last reply Reply Quote 0
          • dafyreD
            dafyre @PhlipElder
            last edited by dafyre

            @PhlipElder said in AWS Catastrophic Data Loss:

            @dafyre said in AWS Catastrophic Data Loss:

            @Dashrender said in AWS Catastrophic Data Loss:

            @dafyre said in AWS Catastrophic Data Loss:

            @Pete-S said in AWS Catastrophic Data Loss:

            @dafyre said in AWS Catastrophic Data Loss:

            @Pete-S said in AWS Catastrophic Data Loss:

            Message From Amazon AWS :

            Update August 28, 2019 JST:

            That is how a post-mortem write up should look. It's got details, and they know within reasonable doubt what actually happened...

            It reads like Lemony Snicket's Series of Unfortunate Events, though, lol.

            Yes, it does. I'm familiar with systems of that type and the problems Amazon experienced were "rookie" mistakes. Imagine a chemical plant or even worse, a nuclear plant that made design mistakes like that. Outcome would have been a little worse than some ones and zeros getting lost.

            It sure seems like rookie mistakes, doesn't it. Their system is complex, and they faced problems along every step of the way. It seems like the biggest mistake here is not actually testing procedures.

            eh? It seemed like they did do test.. they just never had a failure like this in the past. Not saying there isn't room for improvement,

            The blurb that @Pete-S shared doesn't say much about their testing procedures, so you could well be right. I've definitely seem cascading failures like this before where it just seems like all of the 'safeties' that you had in place would fail until things finally shut down.

            That's the inherent problem with hyper-scale systems. There is no way to fully test resilience. None. Nada. Zippo. Zilch.

            It's all fly by the seat of the pants theory until the sh#t statement above happens.

            There is a way to full test resiliency at Hyper-Scale. But to @Pete-S 's comment about money, there's a cost to doing that kind of planning and testing and improving, and retesting. A company with a goal of purely to make as much money as humanly possible is probably not going to put enough money into a FULL resiliency test, but it can be done...

            1 Reply Last reply Reply Quote 0
            • BRRABillB
              BRRABill @Dashrender
              last edited by

              @Dashrender said in AWS Catastrophic Data Loss:

              What I'm saying is that that is USELESS to end users/corporate customers...

              I've been arguing that for years.

              because the chances that MS's DC is going to blow up is extremely small

              And yet, it is what this thread is about ... exactly that happening.

              DashrenderD 1 Reply Last reply Reply Quote 0
              • DashrenderD
                Dashrender @BRRABill
                last edited by

                @BRRABill said in AWS Catastrophic Data Loss:

                because the chances that MS's DC is going to blow up is extremely small

                And yet, it is what this thread is about ... exactly that happening.

                Except that it's Amazon, not MS.

                dafyreD PhlipElderP 2 Replies Last reply Reply Quote 0
                • dafyreD
                  dafyre @Dashrender
                  last edited by dafyre

                  @Dashrender said in AWS Catastrophic Data Loss:

                  @BRRABill said in AWS Catastrophic Data Loss:

                  because the chances that MS's DC is going to blow up is extremely small

                  And yet, it is what this thread is about ... exactly that happening.

                  Except that it's Amazon, not MS.

                  Same difference, though. Think about the number of issues that some folks have with Microsoft.

                  DustinB3403D 1 Reply Last reply Reply Quote 1
                  • DustinB3403D
                    DustinB3403 @dafyre
                    last edited by

                    @dafyre said in AWS Catastrophic Data Loss:

                    @Dashrender said in AWS Catastrophic Data Loss:

                    @BRRABill said in AWS Catastrophic Data Loss:

                    because the chances that MS's DC is going to blow up is extremely small

                    And yet, it is what this thread is about ... exactly that happening.

                    Except that it's Amazon, not MS.

                    Same difference, though. Think about the number of issues that some folks have with Microsoft.

                    Paging @scottalanmiller !!!

                    1 Reply Last reply Reply Quote -1
                    • PhlipElderP
                      PhlipElder @Dashrender
                      last edited by

                      @Dashrender said in AWS Catastrophic Data Loss:

                      @BRRABill said in AWS Catastrophic Data Loss:

                      because the chances that MS's DC is going to blow up is extremely small

                      And yet, it is what this thread is about ... exactly that happening.

                      Except that it's Amazon, not MS.

                      MS was US Central this year or late last.

                      MS was the world when their authentication mechanism went down I think it was a year or so ago.

                      MS was Europe offline with VMs hosed and a recovery needed. Weeks.

                      MS has had plenty of trials by fire.

                      Not one of the hyper-scale folks are trouble free.

                      Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades. Cloud can't touch that. Period.

                      dbeatoD DustinB3403D 2 Replies Last reply Reply Quote 0
                      • dbeatoD
                        dbeato @PhlipElder
                        last edited by

                        @PhlipElder said in AWS Catastrophic Data Loss:

                        @Dashrender said in AWS Catastrophic Data Loss:

                        @BRRABill said in AWS Catastrophic Data Loss:

                        because the chances that MS's DC is going to blow up is extremely small

                        And yet, it is what this thread is about ... exactly that happening.

                        Except that it's Amazon, not MS.

                        MS was US Central this year or late last.

                        MS was the world when their authentication mechanism went down I think it was a year or so ago.

                        MS was Europe offline with VMs hosed and a recovery needed. Weeks.

                        MS has had plenty of trials by fire.

                        Not one of the hyper-scale folks are trouble free.

                        Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades. Cloud can't touch that. Period.

                        And no updates correct right? to have 100 % Up-time you must never do updates.

                        DustinB3403D PhlipElderP 2 Replies Last reply Reply Quote 2
                        • DustinB3403D
                          DustinB3403 @dbeato
                          last edited by

                          @dbeato said in AWS Catastrophic Data Loss:

                          And no updates correct right? to have 100 % Up-time you must never do updates.

                          Exactly. Even the Global Stock Markets have down time for maintenance and patching.

                          1 Reply Last reply Reply Quote -1
                          • PhlipElderP
                            PhlipElder @dbeato
                            last edited by

                            @dbeato said in AWS Catastrophic Data Loss:

                            @PhlipElder said in AWS Catastrophic Data Loss:

                            @Dashrender said in AWS Catastrophic Data Loss:

                            @BRRABill said in AWS Catastrophic Data Loss:

                            because the chances that MS's DC is going to blow up is extremely small

                            And yet, it is what this thread is about ... exactly that happening.

                            Except that it's Amazon, not MS.

                            MS was US Central this year or late last.

                            MS was the world when their authentication mechanism went down I think it was a year or so ago.

                            MS was Europe offline with VMs hosed and a recovery needed. Weeks.

                            MS has had plenty of trials by fire.

                            Not one of the hyper-scale folks are trouble free.

                            Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades. Cloud can't touch that. Period.

                            And no updates correct right? to have 100 % Up-time you must never do updates.

                            In a cluster setting, not too difficult. In this case, 100% up-time is defined as nary a user impacted by any service or app being offline when needed.

                            So, point of clarification conceded.

                            dbeatoD 1 Reply Last reply Reply Quote 0
                            • DustinB3403D
                              DustinB3403 @PhlipElder
                              last edited by

                              @PhlipElder said in AWS Catastrophic Data Loss:

                              Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades.

                              Really, decades of uptime. Not a single bad ram module, raid failure, CPU, PSU or MB issue. No site issues (fire, earthquake, tornado etc) in all that time.

                              @PhlipElder said in AWS Catastrophic Data Loss:

                              Cloud can't touch that. Period.

                              You're full of it.

                              PhlipElderP 1 Reply Last reply Reply Quote 0
                              • PhlipElderP
                                PhlipElder @DustinB3403
                                last edited by

                                @DustinB3403 said in AWS Catastrophic Data Loss:

                                @PhlipElder said in AWS Catastrophic Data Loss:

                                Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades.

                                Really, decades of uptime. Not a single bad ram module, raid failure, CPU, PSU or MB issue. No site issues (fire, earthquake, tornado etc) in all that time.

                                @PhlipElder said in AWS Catastrophic Data Loss:

                                Cloud can't touch that. Period.

                                You're full of it.

                                I'm quite proud of our record. It's a testament to the amount of time and money put in to research, proof, and thrash the solution sets we've sold over the years. We don't sell anything we first don't proof.

                                DustinB3403D 1 Reply Last reply Reply Quote 0
                                • DustinB3403D
                                  DustinB3403 @PhlipElder
                                  last edited by

                                  @PhlipElder said in AWS Catastrophic Data Loss:

                                  @DustinB3403 said in AWS Catastrophic Data Loss:

                                  @PhlipElder said in AWS Catastrophic Data Loss:

                                  Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades.

                                  Really, decades of uptime. Not a single bad ram module, raid failure, CPU, PSU or MB issue. No site issues (fire, earthquake, tornado etc) in all that time.

                                  @PhlipElder said in AWS Catastrophic Data Loss:

                                  Cloud can't touch that. Period.

                                  You're full of it.

                                  I'm quite proud of our record. It's a testament to the amount of time and money put in to research, proof, and thrash the solution sets we've sold over the years. We don't sell anything we first don't proof.

                                  So you're using technology that is at least a decade old for every one of your customers, because by your own word you can't possibly have had the time to test anything from this year and sold it to a customer!

                                  PhlipElderP 1 Reply Last reply Reply Quote -1
                                  • dbeatoD
                                    dbeato @PhlipElder
                                    last edited by

                                    @PhlipElder said in AWS Catastrophic Data Loss:

                                    @dbeato said in AWS Catastrophic Data Loss:

                                    @PhlipElder said in AWS Catastrophic Data Loss:

                                    @Dashrender said in AWS Catastrophic Data Loss:

                                    @BRRABill said in AWS Catastrophic Data Loss:

                                    because the chances that MS's DC is going to blow up is extremely small

                                    And yet, it is what this thread is about ... exactly that happening.

                                    Except that it's Amazon, not MS.

                                    MS was US Central this year or late last.

                                    MS was the world when their authentication mechanism went down I think it was a year or so ago.

                                    MS was Europe offline with VMs hosed and a recovery needed. Weeks.

                                    MS has had plenty of trials by fire.

                                    Not one of the hyper-scale folks are trouble free.

                                    Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades. Cloud can't touch that. Period.

                                    And no updates correct right? to have 100 % Up-time you must never do updates.

                                    In a cluster setting, not too difficult. In this case, 100% up-time is defined as nary a user impacted by any service or app being offline when needed.

                                    So, point of clarification conceded.

                                    Yes, I know you could do a cluster and that's how Cloud Providers give you that 99.9% up-time or SLA. Right now it is hard to believe no one has any issues, if cloud providers in a large scale have issues then smaller companies do have them as well. That said, no cloud provider provides any backups for anyone unless you set them up either through their offering or your company.

                                    IRJI 1 Reply Last reply Reply Quote 1
                                    • PhlipElderP
                                      PhlipElder @DustinB3403
                                      last edited by

                                      @DustinB3403 said in AWS Catastrophic Data Loss:

                                      @PhlipElder said in AWS Catastrophic Data Loss:

                                      @DustinB3403 said in AWS Catastrophic Data Loss:

                                      @PhlipElder said in AWS Catastrophic Data Loss:

                                      Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades.

                                      Really, decades of uptime. Not a single bad ram module, raid failure, CPU, PSU or MB issue. No site issues (fire, earthquake, tornado etc) in all that time.

                                      @PhlipElder said in AWS Catastrophic Data Loss:

                                      Cloud can't touch that. Period.

                                      You're full of it.

                                      I'm quite proud of our record. It's a testament to the amount of time and money put in to research, proof, and thrash the solution sets we've sold over the years. We don't sell anything we first don't proof.

                                      So you're using technology that is at least a decade old for every one of your customers, because by your own word you can't possibly have had the time to test anything from this year and sold it to a customer!

                                      Not sure how that conclusion came about but far from it.

                                      We've had plenty of NDAs over the years to proof with upcoming tech so that we're on the right page and current.

                                      DustinB3403D 1 Reply Last reply Reply Quote 1
                                      • DustinB3403D
                                        DustinB3403 @PhlipElder
                                        last edited by

                                        @PhlipElder said in AWS Catastrophic Data Loss:

                                        @DustinB3403 said in AWS Catastrophic Data Loss:

                                        @PhlipElder said in AWS Catastrophic Data Loss:

                                        @DustinB3403 said in AWS Catastrophic Data Loss:

                                        @PhlipElder said in AWS Catastrophic Data Loss:

                                        Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades.

                                        Really, decades of uptime. Not a single bad ram module, raid failure, CPU, PSU or MB issue. No site issues (fire, earthquake, tornado etc) in all that time.

                                        @PhlipElder said in AWS Catastrophic Data Loss:

                                        Cloud can't touch that. Period.

                                        You're full of it.

                                        I'm quite proud of our record. It's a testament to the amount of time and money put in to research, proof, and thrash the solution sets we've sold over the years. We don't sell anything we first don't proof.

                                        So you're using technology that is at least a decade old for every one of your customers, because by your own word you can't possibly have had the time to test anything from this year and sold it to a customer!

                                        Not sure how that conclusion came about but far from it.

                                        We've had plenty of NDAs over the years to proof with upcoming tech so that we're on the right page and current.

                                        You've said you've tested everything that you sell. How could this possibly be true to make claims of decades worth of up-time. Power supplies fail, switches die, disks die, MB's die, sites lose power (which people still have jobs to do - just because the lights are out. . .)

                                        So you're still full of it. Not to mention performing any update will eventually require a restart. Windows updates, file server migrations etc. All require some downtime.

                                        DashrenderD 1 Reply Last reply Reply Quote -1
                                        • DashrenderD
                                          Dashrender @DustinB3403
                                          last edited by

                                          @DustinB3403 said in AWS Catastrophic Data Loss:

                                          @PhlipElder said in AWS Catastrophic Data Loss:

                                          @DustinB3403 said in AWS Catastrophic Data Loss:

                                          @PhlipElder said in AWS Catastrophic Data Loss:

                                          @DustinB3403 said in AWS Catastrophic Data Loss:

                                          @PhlipElder said in AWS Catastrophic Data Loss:

                                          Most of our clients have had 100% up-time across solution sets for years and in some cases we're coming up on decades.

                                          Really, decades of uptime. Not a single bad ram module, raid failure, CPU, PSU or MB issue. No site issues (fire, earthquake, tornado etc) in all that time.

                                          @PhlipElder said in AWS Catastrophic Data Loss:

                                          Cloud can't touch that. Period.

                                          You're full of it.

                                          I'm quite proud of our record. It's a testament to the amount of time and money put in to research, proof, and thrash the solution sets we've sold over the years. We don't sell anything we first don't proof.

                                          So you're using technology that is at least a decade old for every one of your customers, because by your own word you can't possibly have had the time to test anything from this year and sold it to a customer!

                                          Not sure how that conclusion came about but far from it.

                                          We've had plenty of NDAs over the years to proof with upcoming tech so that we're on the right page and current.

                                          You've said you've tested everything that you sell. How could this possibly be true to make claims of decades worth of up-time. Power supplies fail, switches die, disks die, MB's die, sites lose power (which people still have jobs to do - just because the lights are out. . .)

                                          So you're still full of it. Not to mention performing any update will eventually require a restart. Windows updates, file server migrations etc. All require some downtime.

                                          all of those things can fail - as long as they have an HA solution that accounts for those failures.

                                          As he said earlier - the customer has NEVER been impacted - that's the point of measurement.

                                          PhlipElderP 1 Reply Last reply Reply Quote 1
                                          • IRJI
                                            IRJ
                                            last edited by

                                            Adding this graphic again...

                                            The data is on the customer!

                                            14044fc6-ad7e-44e6-8d15-5198dac3e0b6-image.png

                                            1 1 Reply Last reply Reply Quote 1
                                            • 1
                                            • 2
                                            • 3
                                            • 4
                                            • 3 / 4
                                            • First post
                                              Last post