Home > DevOps, Leadership, Lesson Learned, Skills > Crisis: Chaos to Resolution

Crisis: Chaos to Resolution

SparkPilot_Chaos_OrderCrisis situations tend to be great breeding grounds for confusion and chaos. The good news is that it is fairly easy to stop this natural state. A Leader and a sound and tested crisis management plan.

In this post, I share a simple high-level methodology to handle a number of crisis situations. The model is pretty simple and consists of breaking the situation up into phases with distinct goals for each phase. In the real world, it is pretty common to have to go back to phases that were already completed as more information is found or especially when there are multiple issues. For each phase, there are two themes. Issue(s) and communication.

1. Identify

This is the first phase and marks the start of the crisis management effort. The first step is to do a quick triage and to determine what additional skills are needed. This initial scoping will enable the Incident Commander to determine who needs to be engaged to help isolate the issue further.

Now because the issue has been identified, it is easy to notify customers that there is a known issue and that the team is engaged.

2. Isolate

Based on experience, this is the step that requires the highest level of discipline because of the common behavior for folks to try and solve the issue(s) instead of trying to isolate the issue. It is important to isolate the issue to lay the groundwork in order to stop this issue from occurring in the future.

In this phase, the team needs to isolate the issue(s) down to the granular component(s) that is/are causing the issue(s). As more information is gathered, it enables the Incident Commander to engage or release the appropriate Subject Matter Experts (SME) on the team.

When it comes to customer communication, I prefer to communicate the actual current state until the issue(s) are isolated, even if it is not possible to provide an estimate on when the situation will be resolved. Therefore, when entering into the Isolate phase, I like to notify the customers that the team is engaged and starting to isolate the issue(s). As progress is made, provide progress updates to customers. However when progress is slow, switch to timed update methodology and provide customers with updates on a predetermined interval. There are pros & cons to each of these approaches and therefore you will need to select the mechanism that suits your business needs the best.

3. Restore

Now comes the steps to mitigate or restore the situation to “normal”. The million $ questions is what is normal? The definition of “normal” is best defined before the crisis and I prefer a checklist that can be used to determine if the situation has returned back to a fully functioning state.

Back to the customer communication, with the checklist, you can communicate to the customers that you are on step 3 of 9 or whatever the counts are. To get the durations of each step, this information should be collected during the test runs or other similar crisis situations.

4. Repair

This step is where the issue is repaired should the root cause not be addressed in the Restore step. This step also covers the elimination of any mitigations performed and returning the state to the fully operational state.

Remember that some customers would like to know when this step has been completed.

5. Eliminate

The Best Customer Impacting Incident is the One that didn’t happen!! Therefore, in order to make this statement come true, it means that the team needs to learn as much as possible from all crisis situations and apply these lessons learned to any possible future states. So, it is important to analyze the situation and to define action items with clear owners for each action item. This step will eventually ensure that the situation is eliminated. I prefer to use a formal retrospective review of the crisis situation. This process is very similar to the typical ITIL post mortem but is more collaborative and integrates much better in an agile or continuous operational state.  I will be publishing a blog post on a very effective retrospective technique within the next month or so.

Ideally the information gleaned from the retrospective can be used to formulate a customer communication. Unfortunately, this is not always possible; provide customers with some type of report so that they do not have to speculate and operate in the dark. Then continue to perform the due diligence in the background. This enables a data-driven followup with customers, should it be needed.

  1. No comments yet.
  1. No trackbacks yet.

© 2008-2017 Gavin McMurdo aka SparkPilot All Rights Reserved