Introduction

Why build an online community around “learning from incidents”?

We're at an age in software where learning from incidents is pivotal to our companies’ continued successes. There is a massive opportunity for Software Engineers to learn more about the applications of Resilience Engineering, Human Factors, and Systems Safety to their everyday work with the goal of learning how we can extract value and build expertise from incidents and surprises. I created this website/community to help longtime, new, and aspiring software engineers and leaders learn more about these fields, and practice it within their organizations through conversation, research, and sharing of ideas.

We, as an industry, tend to think we’re currently doing a reasonable job with incidents and learning from them; this is evidenced in conference talks, blogs, and more that repeat the same general advice that is almost bare bones for other safety-critical industries (yes, software is safety-critical). However, I haven’t seen us often look at notions like SRE and other forms of software safety best practices popular today with a critical lens. As a result, we are limiting ourselves in how we actually improve from incidents, near misses, etc. Organizations’ greatest issues with software safety remain not taking the time to know how far behind the curve they may be.

The software industry seems bought-in to the notion of “blameless” approaches to incident reviews (as introduced to the software industry by John Allspaw and Etsy and later popularized in the Google SRE book). I put “blameless” in quotes because instead of doing this, I see most organizations end up doing a “blame dance” of sorts where we know we’re not supposed to say whose fault it was, but we’re still thinking it, which manifests itself in passive ways. Part of this is a result of not being self-critical of industry best practices for SRE and a lack of evolution in the field. We hear a lot of thought-leadership around SRE best practices and moving the needle to five 9’s, but I have to wonder:

Are our organizations actually getting more reliable with these “SRE” mindsets, or are we just feeling better by papering over our problems with processes and nines?

-----

A story on introducing these concepts


I attempted to distill incident analysis into an organization new to me in a similar way I had done so in previous organizations: by introducing a book club on Sidney Dekker’s The Field Guide to Understanding ‘Human Error’. This book is by far the best introduction to the worlds of thinking about incidents from a psychological, organizational behavior, human factors, and Resilience Engineering perspective instead of a technical “error reduction” perspective. I posted a quick message about it and several people signed up for it - and better yet, people were already sharing their excitement, skepticism, and critiques of the writing itself, all before the first meeting! One critique in particular came from a very senior engineer in the company -- someone whose name could be said aloud and all engineers would know that name carried weight when it came to technical decisions. They wrote in the chat room something along the lines of:

“I have to say, I’m interested in discussing this...but I whole-heartedly believe in ‘human error’. I’ve made mistakes, they’ve had real consequences, ask my loved ones.

There was a bit less textual chatter (at least in the form of excitement) in the room after that statement. A very senior engineer had made a sweeping, yet polarizing statement that was heavily combated in the book we were about to discuss together and about a way of thinking that would actually be harmful from an organizational perspective. I was excited to discuss this in person and eager to hear more of their takes on the first few chapters. Their reaction was a common one by folks that have skimmed the book and it was bait I was going to take and draw from.

We can all say we have “made mistakes” and they’ve impacted other people, that’s beside the point - especially from an organizational perspective. The point is to ask how it was even possible to “make that mistake” at all.

This also reminds me of a separate story, an incident occurred at 3 AM (when all the bad incidents occur, right?). I was tasked to lead the investigation of this highly-visible incident after the fact, but a senior engineering leader pulled me aside in the office the morning after and said something along the lines of

“I don’t know if this incident is all that interesting for you to analyze.”
“Why?” I asked.
“Well, because it was all (whispers) ‘human error’. [Kiran] didn’t know what he was doing and wasn’t prepared to own the system. It could’ve waited until morning.”

This was in an organization that thought they were practicing “blamelessness” without a deep understanding of it and when something like this happens (a Kiran makes an “error”), it’s usually met with issuing a new rule or process within the organization without saying publicly that you thought it was Kiran’s fault. This is still blame. It is not only unproductive, it is actually hurting your organization’s ability to generate new insights and build expertise after incidents. It’s also easier to add in rules and procedures. It covers asses and allows us to “move on”, emotionally-speaking. These implementations of new rules and procedures typically don’t come from folks on the front line either. This is because it’s easy to spot errors in hindsight, it’s much more difficult to encourage insights. Unfortunately, adding in new rules and procedures actually diminishes the ability to glean new insights from these incidents.

When stating an incident was the result of a “human error”, that’s usually a sign that the investigation has come to an end. Commonly, when we see “human error” written about in the news, after major accidents, it’s for a reason: it’s easier for us to emotionally cope with the idea that it was someone’s fault. It’s much harder to cope with the fact that the incident was a result of a complex system interacting in complex ways. But, when we claim “human error” as a ‘cause’, despite whether or not we share other ‘causes’ or contributors or enablers, we completely weaken the point of sharing those other contributors and detract attention from them.

Bringing it all together

Ten months ago, I realized I was having many great conversations about Human Factors, Resilience Engineering, and Systems Safety, but they were spread out over a wide array of communication channels. Most of this was as a result of the Masters program I had enrolled in at Lund University. I wanted to get all these people in one place - we were sharing the ideas that learning and understanding incidents works best when we work together and that there is so much more to be unpacked here than what software has done so far. How could we drive the industry forward if we weren’t coming together to talk about it? I started a small community for people directly working in these areas.

A group of over 200 of us (though, I don’t count success with this community in numbers of members) have been learning and growing together for almost a year now. We are learning and teaching each other about organizational behavior, human factors, Resilience Engineering, and beyond -- implementing these learnings in our organizations, and then reporting back. We have had some of the most enlightening and thought-provoking conversations I’ve had in my career. These lessons shouldn't be kept to ourselves. We've decided it was time to start sharing some of our learnings with the rest of the industry.

This hasn't all been me -- it is a passion project of mine, but the true continued success of this project comes from my fellow admins who work everyday with me to create a welcoming, safe space for people to collaborate and share:

John Allspaw
Richard Cook
Jessica DeVita
Will Gallego
Lorin Hochstein
Ryan Kitchens
Laura Maguire

In this website, Learning from Incidents in Software, you’ll hear real stories from these community members -- they’re software professionals, they’re professionals in safety critical industries, they’re researchers studying organizational behavior, they’re product designers - they’re sharing stories about how learning from incidents in their organizations has worked in practice, and how it hasn’t. Stories about learning and implementing real incident analysis. Stories about why it’s important. You’ll also hear academic references rooted in practice. There have been countless studies and research showing how and why different takes on learning from incidents matters. But most of all, the words you’ll see and hear in this place are from people “chopping the wood, and carrying the water”. Along with these stories you’ll hear (via a podcast coming soon) and read about a call to action for the software industry. We have the opportunity to embed this thinking in our organizations before software safety gets committees and checklists that deem it “safe” or “unsafe”.

We are here to reshape how the software industry thinks about incidents, software reliability, and the critical role people play in keeping their systems running. If this community is successful, people will be doing and thinking about incident analysis in completely different ways than they were doing it before — as a valuable lens into not only where incidents come from, but what normally prevents them, what people do (and don’t) learn from them, and what makes incidents matter long after the dust has settled.

We’re here to challenge the conventional views and we’ll be bringing research to back us up.

We’ll have a new post everyday this week with these stories, and then after that, a new post weekly. You can sign up for the newsletter using the form at the bottom of this page.

Here are some examples of topics of the articles we have in the pipeline for you:

Incident analysis -- what it really is and what it could be
Using near misses to understand success
How one organization is implementing “new view” thinking
Feedback and success stories in software
Patience in incident reviews -- how long should this take?
Teaching the “smell” -- distilling expertise through incident analysis
Public vs Private incident reviews
And many more!


References:

Allspaw, John. 2012. Blameless Postmortems. http://codeascraft.com/2012/05/22/blameless-postmortems/

Harlee, Jessica. 2014. What Blameless Really Means. http://www.jessicaharllee.com/notes/what-blameless-really-means/

Klein, Gary. 2014. 'No, your organisation really won't innovate'. https://www.wired.co.uk/article/gary-klein

Kowalski, Kyle. On Enlightenment: 3 Meanings of the “Chop Wood, Carry Water” Zen Quote. https://www.sloww.co/enlightenment-chop-wood-carry-water

Shorrock, Steven. Life After Human Error. https://www.youtube.com/watch?v=STU3Or6ZU60

Snook, Scott. 2002. Friendly Fire.

Wears, R. L., Sutcliffe, Kathleen M. Still Not Safe: Patient Safety and the Middle-Managing of American Medicine.