Making Sense out of Incident Metrics

When I first started in the world of Incident Management my team was mostly focused on handling the response aspect of incidents with little time spent on understanding the entire universe of them. At that time I wouldn’t have been able to tell you how many incidents we had, what part of the org they impacted, or how long they lasted. This meant, as an organization, we didn’t have a way to show the impact incidents were having on our technology organizations as well as on our end users and how we should be focusing our efforts in this regard.

In order to get out of the cycle of fighting fires, we realized we needed to understand our incidents a bit more. By the next year we made an effort to keep track of our incidents, how they came to be (specifically their “root cause”), as well as other details.

Over the years my thoughts on incident metrics have evolved (from focusing on timing details and ranking of root causes to a mechanism that can help further learning from incidents. Today I think of it as a process where we:

1. capture “data” through the analysis of incidents focusing on learning from them (this is done through gathering themes from interviews, review meetings, larger discussions with leaders and experts),

2. periodically take a look at the metrics that leadership usually likes (i.e. MTTR/incident count) and dig into the context that the data I gathered from incidents provides in order to )

3. present this in a digestible format to get buy-in for recommendations for future focus.

The value of metrics

The first time we presented complete incident data to leadership I felt powerful; I had evidence to support the changes I was proposing and I could show the impact that a whole year of my life focusing on these metrics had made.  Because we had that data we were able to make strategic recommendations that led to improvements in how we do our work. In addition, having that data allowed us to get buy-in from leadership to make a number of changes; both in our technology as well as in our culture. We beefed up a number of tools and processes, our product teams started attending retrospectives and prioritizing action items, most importantly everyone across the org had a better understanding of what all went behind “keeping the lights on”.

As this positive feedback loop continues, the issues we deal with are more complex so the solutions are not as clear-cut and as our process evolves so does the data we capture and present.

Evolving your metrics program

The thing about providing good data is once someone realizes the value of metrics, they continue to expect it. That said, the metrics that made sense years ago (focusing on top root causes, or teams involved) may no longer make sense in 2021. It doesn’t mean that we were doing a bad job in the past but with an evolving process, our data and (most importantly) how we present it must also evolve.

Without providing any context, saying that an organization with a number of complex systems had 60 incidents that on average last 45 minutes doesn’t tell us much about the org’s stability or the systems’ ability to recover.

On the other hand, while metrics such as MTTR or incident count may not tell the whole story, having that number may be a starting point to understanding our systems. In fact, I can utilize the often-requested incident metrics to set the direction for further macro analysis. I do this by thinking carefully about what data we want to capture as well as being particular about how we present this information.

Capturing Data

The information we want to get out of specific incidents will vary depending on your organization. Some organizations will want to know information around the number of incidents the entire company deals with or which systems contributed to the longest incidents. An organization that is going through a DevOps transition may be interested in seeing the volume of incidents per team or service, and a company trying to move away from a holiday code freeze may want to see data around incidents in relation to time of year.  I believe in looking at the data points that are captured at least once a year and making any necessary changes; this is how many of us can make the evolution from “root cause” to “contributing factors”.

The data that I describe here is the result of learning from incidents and it is usually the hardest part of the process. You don’t have to only keep track of how long the event lasted, but instead use this as an opportunity to think through the interesting findings from your analysis that you want to compare throughout different incidents. As we complete incident analyses we have the chance to capture themes and learnings, keeping track of these is incredibly useful.

Analyzing and Presenting the information

Depending on the metrics you are asked for (by execs, tech and product leadership, and other stakeholders) it is important to provide context around them. This can be done in a presentation or an email as long as you are able to expand upon the numbers. The metrics most people tend to expect are usually around time to recovery as well as incident numbers and impact. While in a vacuum these metrics don’t tell us much, we can use them as a way to set the direction of the analysis we do around our entire incident universe (after all, with the number of incidents we and most technology organizations have, it is hard to share all of the learnings in one 45 minutes presentation).

For example, when looking at a metric such as Mean Time to Recovery (MTTR), we want to explain why the number is what it is:

  • We can begin by looking for any outliers. Did your organization go through outstanding incidents that look completely different from the others? When presenting the data around MTTR I would make sure to highlight the themes we arrived at during the analysis of those incidents; how we prepare for a catastrophic failure in our systems, how we can empower our engineers to react under pressure, and what kind of recurring maintenance our systems need.
  • Outside of  the outliers we can see which incidents take the longest and shortest time to resolve and try to find commonalities around them. We can see what tools our SMEs used to resolve these incidents, how we came to realize this was a Production Incident that needed to be resolved right away, or how the number of teams involved reflected the complexity of the system impacted.  This can help us guide teams on what tools to adopt or what signals to look out for in the future. The themes we captured throughout the individual incidents will have a chance to shine here. There will be common themes where the impact was felt across different parts of the org so one team may not be able to do much to improve upon it but looking at this from a holistic point of view allows us to prioritize certain areas of focus.

If we are looking at the number of incidents and their impact we can take a similar approach. I am an advocate for analyzing incidents that were near-misses. This gives us a chance to see all the contributing factors and themes around “what went well”. As you present your number of incidents, also present the number of near-misses (in my case, a near-miss is an incident that thanks to the quick response of the team avoided impacting our end users) and highlight your findings.

  • We can deep dive into the recurring themes from those near-misses and use that to help set the direction for future initiatives (for example, if some teams have invested more in their observability and that shows in the number of near-misses then we can recommend other teams follow suit).  
  • I also believe in highlighting any other items that we may be able to learn from even if they’re not specifically related to the requested metrics. For example, following the move to remote work in 2020 many organizations have had to adapt the way they do work and communicate. Understanding the themes that helped us be successful in that transition can drive future changes to our processes and technologies.

All of the items mentioned above make it into the information I provide to those who may not be as intimately involved in Incidents and Incident analysis such as engineering leaders, product owners, legal/compliance representatives, and executes. I believe incident metrics can be a powerful tool in learning from incidents but it is important that they are presented with context. Incident metrics can be best utilized as a way to set direction for further analysis that can help us uncover the themes around how our systems work and how we respond to failures. As analysts it is important to continuously improve upon our processes, question our quantitative and qualitative data, and present information and recommendations in a way that allows others to understand how to move forward.

As I experienced, a targeted but holistic focus on metrics not only helped me more clearly demonstrate the business case to others but it helped my organization become more comfortable with the concept of learning from incidents.