OOPS! Learning from the incident you didn't have

November 28, 2019
Operational Surprises
"An unintended but unavoidable consequence of associating safety with things that go wrong is a creeping lack of attention to things that go right."

- Erik Hollnagel, A Tale of Two Safeties.

In order to understand how things went wrong, we need to first understand how they went right

When an incident happens in an organization, the traditional response is to identify ways to prevent the incident from happening again in the future. The community around this website  takes a different approach towards incident analysis. To paraphrase the late computer scientist Edsger Dijkstra, incident analysis is no more about incidents than astronomy is about telescopes. Instead of focusing on prevention, we seek to leverage incidents as an opportunity to learn as much as possible about how work is done within the organization. 

It's possible to study how work is done in an organization without focusing on incidents. However, one of the advantages that incidents give us is that incident reviews are common throughout our industry, and so, as a consequence, people expect someone to go around asking questions after an incident. As incident investigators, we are just asking different sorts of questions and have different goals in mind than the traditional let's-make-sure-this-doesn't-happen-again approach.

Dave Hahn, from Netflix, reflects on only studying failures.

Aren't the incidents that make the news more important?

Ultimately, there doesn't need to be a business-impacting incident to study how work is done. While incidents with greater business impact get more attention from an organization than incidents with smaller impact, it doesn't necessarily follow that we can learn more from the great impact incidents than the lesser one. In fact, it may be even harder to do an incident investigation for a high-profile incident since there is more scrutiny from an organization. And the difference between a high-profile incident and no incident at all might be something as mundane as a particular individual being out of the office on a certain day.

And, if we can learn just as much from smaller incidents as we can from larger ones, we can also learn just as much from an "incidents" when there is no impact at all! These are the kinds of things we call close calls or near misses.

Any time we encounter an operational surprise, something that happened in operations that we didn't expect, there's an opportunity for us to discover how the observed system behavior deviated from our mental model of how the system is supposed to behave.

There are a few nice things about operational surprises:

First, they don't carry with them the psychological impact of incidents. When somebody is involved in an incident, they can often suffer from "second victim syndrome", where they feel guilty about having contributed to the incident in some way. This doesn't happen with operational surprises, because there was no negative impact.

Second, the term "operational surprise" is much less ambiguous than "incident". As John Allspaw notes, incident severity is negotiable. On the other hand, I've never heard engineers argue over whether something was a surprise or not.

Third, they're likely happening all of the time inside of your organization!

At Netflix, we started the OOPS project to encourage engineering teams to self-report when they encounter an operational surprise.  This writeup contains a narrative description of the events that led to surprise, and identifies contributors, mitigators, risks, and challenges in handling, which is the same writeup structure we use for incidents that we investigate.

Engineers share these write-ups within the organization. These are often commented upon and discussed, sometimes in a structured meeting where we focus on identifying risks and learnings.

By sharing experiences with OOPSies across the organization, we hope to build shared understanding around how the overall system behaves, demonstrate expertise in action, and encourage discussion around signals of risk that these OOPSies reveal. 

Lorin Hochstein

I'm a Senior Software Engineer at Netflix on the Cloud Operations & Reliability Engineering (CORE) team.

Once upon a time, I studied software developers to understand what problems they face and to evaluate whether proposed technologies actually make their lives easier. Nowadays, I study operational surprises (some people call these incidents) in order to improve how Netflix engineers software.

I received a PhD in computer science from the University of Maryland (2006), an M.S. in electrical engineering from Boston University (2002), and a B.Eng. in computer engineering from McGill University (1999). I was an Assistant Professor at the University of Nebraska-Lincoln in the Computer Science & Engineering department from 2006-2008. I was a Computer Scientist at USC/ISI in the Adaptive Parallel EXecution (APEX) Computing Group from 2008-2011, a Lead Architect at Nimbis Services from 2012-2013, and a Lead Software Engineer at SendGrid Labs from 2014-2015.

Related Posts

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form