Obstacles to learning from incidents
Learning from incidents; it sounds easy, right? We’re always learning...
Our team or organization has an incident, maybe we responded to it, and then we “learned” from it.
Unfortunately, just having the incident, even participating in a response team, doesn't mean that the learning that will elevate and improve your organization (at least the way we talk about it on this site) has occurred. Further, even if you have personally learned something, that doesn't mean it’s been disseminated or that others have as well.
I'm mostly operating from the view that we're talking about your incidents, however you slice that. They typically reside within your organization, though not always your immediate group or team. This isn't exclusive; we will talk briefly about incidents in other organizations as well.
The number one obstacle that can keep an organization from learning from an incident is distancing through differencing. This is when an us vs them sort of mentality emerges to produce all sorts of reasons that though this happened to them it, of course, would never happen to us.
You can be on guard for phrases like:
- "We would never respond that way"
- "We have our own process to protect us from that"
- "We wouldn't do something [stupid] like that"
- "We're way more careful than they are"
- "Day (or whichever other) shift is always like that, we're not like that"
Any of the phrases like the above are strong signals that you're seeing distancing through differencing in action. This is something that can happen at multiple levels, when hearing about an incident at another organization or even when hearing about an incident at your organization, but not in your group or team.
You can begin to counteract this just by knowing and being aware that such a phenomenon exists in our psychology. Of course, all incidents have different specifics, but very often there are patterns that we can recognize in our own environment.
Further, when examining an incident, we can purposefully look for similarities between that incident and our current context about how the system should work.
While having faith in our individual and group ability to face challenges can be a good indicator of an effective team, it is also possible to be overconfident in our ability to respond to or control a system hazard. We can become miscalibrated about the actual risk of an accident or incident to such a degree that we don't think about it or attempt to prepare for it. This of course can lead to surprise if an incident occurs involving that risk.It can be a contributor to a sort of denial that can encourage distancing through differencing as well. I call this out as a specific section though, because treating it can help both further learning and potential future response.
That doesn't mean that we can't gain confidence in our abilities, but we also don't wish to become out of touch with the reality of risks. High reliability organizational research calls this "preoccupation with failure," which can help us learn and respond more effectively.
For example, during my time in EMS I learned to not only trust in my abilities, but also to recognize that accidents would occur that were well beyond them. There was no world where I would be able to treat everything I'd find.
You've likely seen processes where teams engage in a "root cause analysis" or similar, the goal being to find the one thing that started it all so it can be fixed. That way it won't ever happen again. This sounds nice, but unfortunately in complex systems, like the software systems we build and maintain, this doesn't apply. There are multiple things that need to occur in order for the incident to have happened - and they’re important to understand beyond the precipitating event.
This might just sound like a matter of swapping out words, but it's actually a way of changing approaches. If we search for just one thing, we're making it almost impossible to find all the other things that contributed that we could possibly learn from. Starting with "root cause" means that we're going into it essentially with the bias of, "I'm going to blind myself to some contributing factors," which will inevitably constrain our learning.
Further, it can constrain others' learning as well. If it's communicated that there was a root cause to the incident, another person may presume that that is all there is, one fix to be made, problem done and dusted. Nothing to learn from there they may think.
Of course, our investigations can't continue on eternally, we must constrain our search in some way. But understanding how we choose to constrain our investigations and the naturally occurring constraint in our search for understanding can help us view those results, these lessons from the investigation, in a more effective light.
Only trying to learn from "bad" things
Often times learning attempts that do take place are strongly focused on the bad things, like what went wrong in a response or a bad outcome. But there is a lot to learn from incidents about what went well, what kept an outcome from being worse, what helped marshal a productive response (see also: OOPS! Learning from the incident you never had).
Constraining our search for learning to only bad things means we'll miss all the good things. This isn't just a matter of being upbeat or positive, it helps us potentially notice things like:
- What parts of our incident response plan do we want to practice since it was especially effective?
- What displays or system information helped bring clarity to the chaos?
- What groups or responders were especially prepared to respond to this incident? How did that happen?
- How do we uncover the unwritten know-how held by the daily practitioners making subtle adjustments to keep our systems in a stable state before things go wrong?
High pressure reporting requirements
This one might seem a bit surprising; after all if incidents aren't reported, they can't be learned from. It's true that if an incident is swept under the rug it can be difficult to learn from, but the other extreme end of the continuum, requiring investigations and reports to be produced in a tightly specified time period can create pressure to "resolve" an investigation quickly.
This is likely to produce, at best, only a cursory investigation of the incident, making it incredibly difficult to learn from. In my experience, this sort of deadline approach is often a reaction to investigations not being done previously, or a fear that they wouldn't be done otherwise. As a result, a time period is mandated with the hope that that will solve the issue.
If broader change to this approach isn't yet possible in your organization or group, I've seen some limited success in changing what the deadline represents. That is, requiring an investigation to begin within a certain period (ideally broadly specified as a guideline), but no firm specification on when an investigation has to be done by, as that of course would vary depending on the incident.
Making sure this never happens again
As we touched on before, every incident is unique if you look at a detailed enough view. It's also possible to get stuck in the details, especially if your goal is to prevent this specific incident, with this exact outcome from ever happening again.
The detailed view of the individual incident itself should be the starting point to then zoom out and see more and more general things about the system as a whole. If we remain in the very low level detail of an individual incident, we are at risk for creating "fixes" that don't really address the problem and may see a similar looking incident in the future, despite our best efforts.
Looking at this through the lens of emergency medicine has helped frame this way of thinking. It would be strange to imagine a responder's purpose to ensure a car accident never happens again. Instead, they can advance their skillset and influence systemic factors that can reduce the impact of an entire class of accidents. If they focused all of their efforts on making sure one exact type of car accident didn't happen, one in which someone swerved to miss a deer, on a wet road, while making a turn, etc.they'd likely change a whole bunch of things that may make other accidents more common or impactful while failing to advance the ability to respond to all those other accidents.
Confusing writing, distribution, or meetings with learning
Often times, in post-incident analysis there will be artifacts produced, possibly some documents. There may even be some meetings during the process. The artifacts may be distributed throughout the company.
These may be helpful to the process of investigation, the artifacts can even be helpful to review. But none of them inherently cause learning to occur.
It's important to realize that simply because a meeting was held, a document created, a memo sent, that learning may or may not have occurred. You can make it easier or more effective to learn from these processes or artifacts or create a culture where learning is highly valued, but as an individual or "safety" type group, you cannot cause learning to occur in others.
When I was in emergency medicine, I learned this firsthand. For one particular ambulance company I would often write reports as a result of my work, but the only people that I knew of that would read them was the "quality" department. If I wrote it in some way they didn't like I'd end up with a meeting often called "chart review" or "report review."
This process has a lot of the characteristics of what software teams often do in hopes of learning. There would be an incident, I would write a report about it, perhaps a meeting would get scheduled, but it's pretty clear that no substantial learning occurred or was disseminated. That is, unless you count me learning how to write reports to suit a particular audience or learning how much I dislike those meetings.
If you've used this process before, you've likely seen similar results, people may become suspicious of the value of meetings and processes that look similar and understandably so. In my experience, a way to help in this situation (and a good idea in general), is to focus on conversations. Talk to individuals about what they experienced. You'll almost certainly learn from it.
As a first step, you can stop there. As people get used to asking questions and having conversations you can begin to write up some of it if you want a longer lasting method of distribution. I'd caution though to not expect anywhere near the same level of information or insight gained from the writeup than from conversations. If someone reads it later they may be lacking the context and background to understand how and why things were so confusing and hard to see in the moment.
This is ultimately where most of my learning from incidents came from in medicine and rescue settings, storytelling. Getting others to tell me their stories helped me begin to absorb some of their hard won lessons.
As you may have gleaned by now, learning from incidents is hard work. It is not something that happens passively, and the processes by which it occurs and the hazards that can prevent it, just like our systems, are complex as well.
Fortunately, there are things we can do to help make substantial learning more likely, to create an environment where it is more likely to occur, both in ourselves and in our organizations. We can seek similarities to combat distancing through differencing. We can become preoccupied with failure to help prevent overconfidence. We can look at the good things along with the bad. We can consider multiple factors instead of seeking a single cause. We can reduce the pressure to produce a report by a certain time. Finally, we can change our view on what level of incident we focus on, seeking broader patterns and lessons.