Patience in Implementing Effective Incident Reviews
Pressure in just getting an incident review "done"
As we struggle to understand how things go wrong, to learn from incidents, and to prepare ourselves for future surprises, the hurried rush to get answers and get them now, immediately after an incident, is prevalent. We have things to do, need to move on, and there are just so many other things to fix. We need to get to the bottom of an incident before it happens again.
To loosely quote a famous axiom, the best time to prepare for failure was before the incident; the second best time is right now.
As John Allspaw has pointed out previously, this urgency to complete an incident review goes up depending on how "big" the incident was. However, it's much harder to complete an effective incident review for a "big" incident. These arguably need more time to complete.
This is all perfectly reasonable, and even a natural, typical response. The urgency is right there, with higher stakes and ever increasing complexity tightly coupled with the growing business needs for companies. Our problems aren't going away and the cost of failing to learn may be even worse next time. Uncertainty looms over us!
We may be facing a situation where we're the victims of our own success as well. At minimum, people can generally suss their way around the vague ideas behind retrospectives or incident analysis, efforts in which we're not looking to throw folks under the bus but see problems from many angles. The spark of interest in learning is there!
If anything, we're far behind so many other industries who have been practicing Resilience Engineering for decades before. Isn't it time for us in the software industry to catch up, to cover ground in large strides to make up for lost time?
In all this hurry to get things done, to move on to the next project, to dig wide and deep into "the good stuff", it's easy to forget that learning, the real concrete durable expertise needed in a crisis, doesn't happen overnight. For incident review facilitators, pattern recognition of inflection points during timelines and instinctively knowing how to ask just the right questions, these are the muscles we build over time. It's tempting to pile up a stack of books, dive head first into it, and come out the other side with all the answers, itching to implement them.
Developing a mature outlook to incident investigation is more than having quoatables on hand. It takes experience, and while that can only come from experimenting to see what works (and more often, what doesn't), there isn't a Stack Overflow for these studies.
Likewise, those of us at the "sharp end" tapping away furiously at our keyboards are classically trained to think fast and respond quickly. Every second delayed is money lost due to downtime, customers getting turned away, and colleagues awaiting answers. This is also another reason Mean Time to Respond is such a poor indicator of an engineering team's ability to function. If I take two extra minutes of downtime to accurately assess, to build my confidence, and coordinate with folks for the best path forward, could that be worth the additional time? MttR in the most fatalistic form is pervasive in the never ending pursuit of shrinking down a number. Resolving outages quickly is laudable, so long as the resolution is more than a band-aid.
Good work takes time.
Extending this just a bit further to the study of incident analysis as a practice, it's very tempting as engineers to solve problems ourselves. It's a lot of what we do in our day to day work: figure out what's on fire and put it out or build supportive systems for future functionality that can hopefully weather the unexpected. That's a very easy trap to fall into when first donning the role of incident facilitator. You have a room full of folks, physical or virtual, that are expecting you to lead the discussion. The inclination is then to gather the facts yourself and lay them out, to get to the heart of the issue so others can learn. This is tragically common but very wrong. As a facilitator, our goal is to get others to speak and to share their expertise. Teasing out the perspectives of everyone involved directly and indirectly with an incident is the crucial step to surfacing those hidden truths. Listening should be the first thing in our minds as facilitator. Not to speak or even to ask questions, but learn how to process what folks are saying. That takes a lot of time.
You might even say that sometimes resilience means not trying to figure out every corner of an incident. This may seem counterintuitive as well. We're discussing all the work we need to do, questions that need to be asked. We say that there's literally an endless breadth and depth to every incident. We can always go another layer deeper. It would seem to follow then that we can't spare a single moment. It's important to keep in mind, though, that all this effort takes its toll, the constant expenditure of energy. It's not a sprint, it's a marathon, requiring a self monitored pacing to prepare us for the next one.
Now it may be easy to accept all of this and think "Yeah yeah, I got it. Let me at that 'resilience'. I'm going to 'add so much resilience' to my system!". The introduction of these topics, letting them slowly work down into the lower parts of our brains until they become deeper, nearly instinctive truths is a requirement. It's one thing to hold a blame-aware retrospective. It's another to understand how the language surrounding it is key without simply resorting to passive voice to mimic blame awareness ("Somebody typed in the SQL query that brought down the database"). A blog post, a conference talk, a strongly recommended book - these are fine things but just the starting points for this type of understanding.
We're building up connections that aren't immediately there, both throughout the field and within ourselves, maturing along the way. This work (implementing more effective incident reviews) has to be more than simply cramming for a test. It has to be built to last, to become innate in a way that allows us greater flexibility in tapping into it when we need to. This is an area of knowledge expending around us that will take our entire careers to master a few small parts of. Yes, we have a ton of work ahead of us and we need to get to it. Just remember to take a breath every now and then.
Note from the editor: You'll notice the author used the term "Incident Review" instead of "Postmortem". Words have weight and meaning. The word "Postmortem" carries a lot of weight - it literally translates to "after death". If the name of a meeting or document resulting from an incident has a heavy, ominous weight associated, it can adversely impact how and if people engage. Using the term "Incident Review" has been found to have a much more positive impact, focused on learning and used by many of the members of this community at their companies.