Archive

Archive for the ‘DevOps’ Category

Crisis: Chaos to Resolution

October 14th, 2015 No comments

SparkPilot_Chaos_OrderCrisis situations tend to be great breeding grounds for confusion and chaos. The good news is that it is fairly easy to stop this natural state. A Leader and a sound and tested crisis management plan.

In this post, I share a simple high-level methodology to handle a number of crisis situations. The model is pretty simple and consists of breaking the situation up into phases with distinct goals for each phase. In the real world, it is pretty common to have to go back to phases that were already completed as more information is found or especially when there are multiple issues. For each phase, there are two themes. Issue(s) and communication.

1. Identify

This is the first phase and marks the start of the crisis management effort. The first step is to do a quick triage and to determine what additional skills are needed. This initial scoping will enable the Incident Commander to determine who needs to be engaged to help isolate the issue further.

Now because the issue has been identified, it is easy to notify customers that there is a known issue and that the team is engaged.

2. Isolate

Based on experience, this is the step that requires the highest level of discipline because of the common behavior for folks to try and solve the issue(s) instead of trying to isolate the issue. It is important to isolate the issue to lay the groundwork in order to stop this issue from occurring in the future.

In this phase, the team needs to isolate the issue(s) down to the granular component(s) that is/are causing the issue(s). As more information is gathered, it enables the Incident Commander to engage or release the appropriate Subject Matter Experts (SME) on the team.

When it comes to customer communication, I prefer to communicate the actual current state until the issue(s) are isolated, even if it is not possible to provide an estimate on when the situation will be resolved. Therefore, when entering into the Isolate phase, I like to notify the customers that the team is engaged and starting to isolate the issue(s). As progress is made, provide progress updates to customers. However when progress is slow, switch to timed update methodology and provide customers with updates on a predetermined interval. There are pros & cons to each of these approaches and therefore you will need to select the mechanism that suits your business needs the best.

3. Restore

Now comes the steps to mitigate or restore the situation to “normal”. The million $ questions is what is normal? The definition of “normal” is best defined before the crisis and I prefer a checklist that can be used to determine if the situation has returned back to a fully functioning state.

Back to the customer communication, with the checklist, you can communicate to the customers that you are on step 3 of 9 or whatever the counts are. To get the durations of each step, this information should be collected during the test runs or other similar crisis situations.

4. Repair

This step is where the issue is repaired should the root cause not be addressed in the Restore step. This step also covers the elimination of any mitigations performed and returning the state to the fully operational state.

Remember that some customers would like to know when this step has been completed.

5. Eliminate

The Best Customer Impacting Incident is the One that didn’t happen!! Therefore, in order to make this statement come true, it means that the team needs to learn as much as possible from all crisis situations and apply these lessons learned to any possible future states. So, it is important to analyze the situation and to define action items with clear owners for each action item. This step will eventually ensure that the situation is eliminated. I prefer to use a formal retrospective review of the crisis situation. This process is very similar to the typical ITIL post mortem but is more collaborative and integrates much better in an agile or continuous operational state.  I will be publishing a blog post on a very effective retrospective technique within the next month or so.

Ideally the information gleaned from the retrospective can be used to formulate a customer communication. Unfortunately, this is not always possible; provide customers with some type of report so that they do not have to speculate and operate in the dark. Then continue to perform the due diligence in the background. This enables a data-driven followup with customers, should it be needed.

Crisis Leadership

October 11th, 2015 No comments

SparkPilot_CrisisLeadership always matters and in a crisis situation, leadership matters even more! When running a service, one of the most critical times is when the service fails and someone will need to step up and take the lead. Without someone taking the reigns, I have seen a mariad of situations arise and as such I am sharing a very high-level definition of the key focus areas for crisis leadership role often called an Incident Commander or Crisis Manager.

Assemble the team

The very first responsibility of this role, is to assemble the team with the skills needed to restore the service ASAP. Then if the problem shifts or there are multiple issues, it might be necessary to adjust the composition of the team to ensure that the skills needed are available. Sometimes in a long running situation it might also be necessary to perform shift changes including the Incident Commander.

Communicate, communicate, Communicate

A service without customers will not last for long and as such it is imperative that the situation be communicated in a clear and concise manner on a regular and predictable rhythm. My preferred communication rhythm is either 15 minutes or 30 minutes and needs to be defined as part of the Standard Operational Procedure (SOP).

Maintain Focus

Ensure that the team maintains the necessary focus needed to restore the service. It is my experience that engineers who don’t know what to do after 5 minutes of thought time will still not know what to do after 25 minutes of thought. Bring in another engineer who is able to operate more effectively under pressure. As such, I would suggest that the Incident Commander operate according to a predefined process to handle the situation where slow or no progress is being made. This enables the Incident Commander to engage other engineers to help expedite things as part of the SOP and not making it personal. This is one of the most difficult tasks an employee can be asked to perform because most folks do not want to wake up others in the middle of the night.

Quality

Do a job right or don’t bother! This is really easy to say yet extremely difficult to adhere to in a crisis situation. It boils down to Leadership and someone flying this flag. In most cases people will follow the lead.

Urgency

The best idea tomorrow really doesn’t help us solve today’s crisis situation. I am big fan of using countdown timers to create the stimuli needed to engage additional team members or escalate to executives.

 

 

 

 

© 2008-2017 Gavin McMurdo aka SparkPilot All Rights Reserved