Recently, I've been spending some time reflecting on my first year as an incident analyst and what resources have been most helpful on this journey. The first resource that came to my mind was a video by Dr. Johan Bergstrom called Three analytical traps in accident investigation. Since almost day 1 of my incident analyst journey, this video has been indispensable to me.
The reason that the video has been so useful to me is that it uses specific examples of what not to do taken from an actual accident report, the United States National Transportation Safety Board's report written about the Asiana Airlines crash from July 2013.
In this post I’ll be taking you through the traps that the video highlights, as well as examples of what I watch out for to avoid falling into these traps. I still watch this video every week so I can keep these traps at the front of my mind.
Avoid counterfactual reasoning.
As Bergstrom says in the video, counterfactual reasoning creates a parallel universe, which is not useful for incident analysis. Our job as incident analysts is to tell the story of the incident as it unfolded to those who took part in it, not what would have happened if they had done something else. As Bergstrom points out in the video, the Asiana Airlines pilots did what they did because that is what made sense to them at the time.
While studying my first incident with my team lead David I put in our notes something like:
“If the team had just not turned off the power, then recovery steps would have been far easier.”
David kindly pointed out that I had fallen into the counterfactual trap, told me this video was a good resource, and that moment has stayed with me to this day.
The use of Normative language indicates when the incident analyst injects their values into the study of an incident by measuring response based on some "norm", or idea of what is appropriate behavior. The problem with using this language is that it makes it sound like those involved in an incident chose to make a mistake. As Bergstrom says in the video:
“It is as if the pilots chose to fly a mismanaged profile.”
Sounds crazy, right? None of us wants to inject our values into the study of an incident. But it is an easy trap to fall into. I've caught myself writing things like:
“The SRE team made an appropriate decision when they decided to use a Grafana dashboard to display critical alerts.”
The problem with this quote is that, as the video says, this statement makes me sound like a judge rather than a curious analyst. It's not my job to judge whether an SRE team made a "good" or "bad" choice, but to understand why that choice was made.
Now I’m extra careful to check for normative language in the incident reports we prepare.
Mechanistic reasoning suggests that incidents are caused by a malfunctioning set of components in an otherwise perfectly working system. Therefore, If we eliminate all the components that were working then we must be able to find the problem, which is typically the person who "made a mistake".
The problem with this reasoning is it goes against what we know to be true in the case of flight 214 or any complex system. In the case of the Asiana airlines crash the flight systems the pilots used were working as they expected until they realized that the plane was in trouble, and by that time it was too late. Accidents, or incidents in the case of software, are often the result of well-functioning systems acting in ways that we do not expect them to.
A story comes to mind that I'll relate which further helps to illustrate this trap. Our incident analyst team was having a conversation during our office hours when a new person to our team unfamiliar with the learning from incidents approach was relating the story of an incident he was studying where an engineer had forgotten to rotate a certificate.
Paraphrasing this new analyst:
“If this guy had just remembered to rotate the certificate than things would have been fine. I mean, it was written on his whiteboard for weeks!!”
Our response went something like:
“Well how do you know that could have prevented the incident. What kind of other pressures was this guy under? What other high priority things did he have on his plate?”
Our colleague didn't know it at the time, but he was engaging in mechanistic reasoning. For him, the problem boiled down to one thing, the engineer who never got around to rotating certificates, when there could have been so many other things at play. He has since thanked us for helping him to think more holistically about our systems during incidents.