A Public Post Mortem of An Outage

scottalanmiller

From SMBITJournal's "A Public Post Mortem of An Outage"

Many things in life have a commonly accepted “conservative” approach and a commonly accepted “risky” approach that should be avoided, at least according to popular sentiment. In investing, for example, we often see buying government or municipal bonds as low risk and investing in equities (corporate stocks) as high risk – but the statistical numbers tell us that this is backwards and nearly everyone loses money on bonds and makes money on stocks. Common “wisdom”, when put to the test, turns out to be based purely on emotions which, in turn, as based on misconceptions and the riskiest thing in investing is using emotion to drive investing strategies.

Similarly, with business risk assessments, the common approach is to feel an emotional response to danger and this triggers a panic response and makes it a strong tendency for people to over compensate for perceived risk. We see this commonly with small companies whose IT infrastructure generates very little revenue or is not very key to short term operations spending large sums of money to protect against a risk that is only partially perceived and very poorly articulated. This often becomes so dramatic that the mitigation process is often handled emotionally instead of intellectually and we regularly find companies implementing bad system designs that actually increase risk rather than decreasing it, while spending very large sums of money and then, since the risk was mostly imaginary, calling the project a success based on layer after layer of misconceptions: imaginary risk, imaginary risk mitigation and imaginary success.

In the recent past I got to be involved in an all out disaster for a small business. The disaster hit what was nearly a “worst case scenario.” Not quite, but very close. The emotional response at the time to the disaster was strong and once the disaster was fully under way it was common for nearly everyone to state and repeat that the disaster planning had been faulty and that the issue should have been avoided. This is very common in any disaster situation, humans feel that there should always be someone to blame and that there should be zero risk scenarios if we do our jobs correctly, but this is completely incorrect.

Thankfully we performed a full port mortem, as one should do after any true disaster, to determine what had gone wrong, what had gone right, how we could fix processes and decisions that had failed and how we could maintain ones that had protected us. Typically, when some big systems event happens, I do not get to talk about it publicly. But once in a while, I do. It is so common to react to a disaster, to any disaster, and think “oh, if we had only….”. But you have to examine the disaster. There is so much to be learned about processes and ourselves.

First, some back story. A critical server, running in an enterprise datacenter holds several key workloads that are very important to several companies. It is a little over four years old and has been running in isolation for many years. Older servers are always a bit worrisome as they approach end of life. Four years is hardly end of life for an enterprise class server but it was certainly not young, either.

This was a single server without any failover mechanism. Backups were handled externally to an enterprise backup appliance in the same datacenter. A very simple system design

I won’t include all internal details as any situation like this has many complexities in planning and in operation. Those are best left to an internal post mortem process.

When the server failed, it failed spectacularly. The failure was so complete that we were unable to diagnose it remotely, even with the assistance of the on site techs at the datacenter. Even the server vendor was unable to diagnose the issue. This left us in a difficult position – how do you deal with a dead server when the hardware cannot reliably be fixed. We could replace drives, we could replace power supplies, we could replace the motherboard. Who knew what might be the fix.

In the end the decision was that the server as well as the backup system had to be relocated back to the main office where they could be triaged in person and with maximum resources. In the end the system ended up being able to be repaired and no data was lost. The decision to restrain from going to backup was made as data recovery was more important than system availability.

When all was said and done the disaster was one of the most complete that could be imagined without experiencing actual data loss. The outage went on for many days and a lot of spare equipment, man hours and attempted fixes were used. The process was exhausting but when completed the system was restored successfully.

The long outage and sense of chaos as things were diagnosed and repair attempts were made led to an overall feeling of failure. People started saying it and this leads to people believing it. Under an emergency response condition it is very easy to become excessively emotional, especially when there is very little sleep to be had.

But when we stepped back and looked at the final outcome, what we found surprised nearly everyone: the triage operation, and the initial risk planning had been successful.

The mayhem that happens during a triage often makes things feel much worse than they really are. But our triage handling had been superb. Triage doesn’t mean magic and there is discovery phase and a reaction phase. When we analyzed the order of events and laid them out in a time line we found that we had acted so well that there was almost no possible place where we could have shorted the time frame. We had done good diagnostics, engaged the right parties at the right time, gotten parts into logistical motion as soon as possible and most of what appeared to have been frenetic, wasted time was actually “filler time” where we were attempting to determine if additional options existed or mistakes had been made while we were waiting on the needed parts for repair. This made things feel much worse than they really were, but all of this was the correct set of actions to have taken.

From the triage and recovery perspective, the process had gone flawlessly even though the outage ended up taking many days. Once the disaster had happened and had happened to the incredible extent that it did, the recovery actually went incredibly smoothly. Nothing is absolutely perfect, but it went extremely well. The machine worked as intended.

The far more surprising part was looking at the disaster impact. There are two ways to look at this. One is the wiser one, the “no hindsight” approach. Here we look at the disaster, the impact cost of the disaster, the mitigation cost and apply the likelihood that the disaster would have happened and determine if the right planning decision had been made. This is hard to calculate because the risk factor is always a fudged number, but you can get accurate enough, normally, to know how good your planning was. The second way is the 20/20 hindsight approach – what if we knew that this disaster was going to happen, what would we have done to prevent it? It is obviously completely unfair to remove the risk factor and see what the disaster cost in raw numbers because we cannot know what is going to go wrong and plan only for that one possibility or spend unlimited money for something that we don’t actually know if it will happen. Companies often make the mistake of using the later calculation and blaming planners for not having perfect foresight.

In this case, we were decently confident that we had taken the right gamble from the start. The system had been in place for most of a decade with zero downtime. The overall system cost had been low, the triage cost had been moderate and the event had been extremely unlikely. That when considering the risk factor we had done good planning was not generally surprising to anyone.

What was surprising is that when we ran the calculations without the risk factor, even had we known that the system would fail and that an extended outage would take place we still would have made the same decision! This was downright shocking. The cost of the extended outage was actually less than the cost of the needed equipment, hosting and labour to have built a functional risk mitigation system – in this case that would have been having a fully redundant server in the datacenter with the one that was in production. In fact, the cost savings by accepting this extended outage had saved close to ten thousand dollars!

This turned out to be an extreme case where the outage was devastatingly bad, hard to predict, unable to be repaired quickly and yet still resulted in massive long term cost savings, but the lesson is an important one. There is so much emotional baggage that comes with any disaster, if we do not do proper post mortem analysis and work to remove emotional responses from our decision making we will often leap to large scale financial loss or placing blame incorrectly even when things have gone well. Many companies would have looked at this disaster and reacted by overspending dramatically to prevent the same unlikely event from recurring in the future even when they had the math in front of them to tell them that doing so would waste money even if that even did recur!

There were other lessons to be learned from this outage. We learned where communications had not been ideal, where the right people were not always in the right decision making spot, where customer communications were not what they should have been, the customer had not informed us of changes properly and more. But, by and large, the lessons were that we had planned correctly, and our triage operation had worked correctly and we had saved the customer several thousand dollars over what would have appeared to have been the “conservative” approach and by doing a good post mortem managed to keep them, and us, from overreacting and turning a good decision into a bad one going forward. Without a post mortem we might very likely have changed our good processes thinking that they had been bad ones.

The takeaway lessons here that I want to convey to you, the reader, are that post mortems are a critical step in any disaster, traditional conservative thinking is often very risky and emotional reactions to risk often cause financial disasters larger than the technical ones that they seek to protect against.

Dashrender

$10K isn't exactly chump change, but it's not really a lot either. 3 day outage and it didn't equate to a 10K in losses? What was on that server?

JaredBusch

@Dashrender said in A Public Post Mortem of An Outage:

$10K isn't exactly chump change, but it's not really a lot either. 3 day outage and it didn't equate to a 10K in losses? What was on that server?

System down time does not directly relate to a complete loss of revenue as most business try to claim. It more often is related to a slower revenue stream, which significantly expands the time that things can be down.

Dashrender

Of course I understand that. Each business is different.

When we self hosted our EHR, if it was down for a day, we could literally cancel clinics until it was fixed. While many of those patients would be rescheduled, we'd be paying staff to be onsite cleaning up, etc, and those costs add up against the no income stream we would be having.

scottalanmiller

@Dashrender said in A Public Post Mortem of An Outage:

$10K isn't exactly chump change, but it's not really a lot either. 3 day outage and it didn't equate to a 10K in losses? What was on that server?

It was more than $10K is losses, $10K is how much cheaper it was to take the loss rather than to pay to mitigate it.

Most SMBs can have their servers down for a bit without major impact. AD, for example, will have near zero impact on a normal business because of cached creds.

scottalanmiller

@JaredBusch said in A Public Post Mortem of An Outage:

@Dashrender said in A Public Post Mortem of An Outage:

$10K isn't exactly chump change, but it's not really a lot either. 3 day outage and it didn't equate to a 10K in losses? What was on that server?

System down time does not directly relate to a complete loss of revenue as most business try to claim. It more often is related to a slower revenue stream, which significantly expands the time that things can be down.

Exactly. They didn't have email on there, nor phones, so their communications didn't go down. And AD was cached, so not impacted except for users migrating from one desktop to another which they don't do (or very rarely) and all of their applications work offline. They were certainly impacted, but it definitely didn't bring the business to its knees, either.

scottalanmiller

@Dashrender said in A Public Post Mortem of An Outage:

When we self hosted our EHR, if it was down for a day, we could literally cancel clinics until it was fixed. While many of those patients would be rescheduled, we'd be paying staff to be onsite cleaning up, etc, and those costs add up against the no income stream we would be having.

A major factor for a lot of businesses is the rubber band effect of work - only companies running with their production "backs against the wall" can't experience it. What happens is that the staff gets time to "rest" while nothing is happening. They might take time off, just have a "lazy day" or catch up on the other things... cleaning the office, rearranging the furniture, physical filing, whatever. The chances that it would have zero value are very low, almost impossible. Then, when the systems return, they are better prepared to work more intensely and can often catch up either partially or fully. Rarely do you have a total productivity loss and rarely a total recovery, but normally somewhere in between as you tend to work faster and take on work more productively.

Since a normal business isn't doing as much work as it could possibly do (only those that don't do sales and have no ability to take on new customers without more resources) they can normally catch up to some degree. This doesn't work for businesses like a 911 call center, of course. But a typical business can, at least partially.

DustinB3403

Good topic.

One question, how long was the outage?

scottalanmiller

@DustinB3403 said in A Public Post Mortem of An Outage:

One question, how long was the outage?

Nearly a week.

DustinB3403

@scottalanmiller said in A Public Post Mortem of An Outage:

@DustinB3403 said in A Public Post Mortem of An Outage:

One question, how long was the outage?

Nearly a week.

Wow, that is a rather long time.

scottalanmiller

@DustinB3403 said in A Public Post Mortem of An Outage:

Wow, that is a rather long time.

Yup, parts were very hard to get and getting the server physically moved before diagnostics could begin ate huge amounts of time up. Cost of speeding things up would have been huge - replacing gear instead of repairing it. But since the vendor could not diagnose the issue with the hardware (their error messages were ones that they did not have documented) it complicated things greatly.