Incidents in a time of market uncertainty
Incidents are inherently related to market conditions, as well as what's going on in the world. When money gets tight, companies pinch pennies, and cash becomes king. This means, how your customers view you, matters more than ever; and if they can't rely on you, it's imperative to be thinking about how that became the case. Because even if they're not directly canceling their licenses or subscriptions, it's impacting your bottom line in other ways (time, attrition, bloat in certain areas of the business).
Welcome to the first installment of the Learning from Incidents newsletter! You're receiving this, because at some point, you signed up on our website to receive updates from this community. I'm Nora Jones, the founder of the Learning from Incidents community.
Who is the LFI community?
We're made up of over 400 members from all over the tech industry. From the Fortune 100, to various startups, to companies that create applications you use and love everyday. We all have one thing in common -- we work on the incidents of these platforms, we're called when things go awry. We have stories to share and knowledge of ways our systems can be hardened.
However, we also have a lot more in common, we know that incidents are not improved by reporting on them more, they're improved when people learn from them. But how do you know when people are learning? What's the difference between individual learning and team learning and organizational learning? That's what we're here to help you with, and frankly why we exist. We're experts in both the human, and technical side of incidents and we all exist in a Slack community that we talk, and learn, and share in everyday so that we can share back to you...with that said, here's what's on our minds this week:
We have big news...
A chance to gather with us in person: LFI Conference 2023
The LFI community along with sponsors such as Jeli, Indeed, ACL, and others are hosting the very first Learning from Incidents conference. This conference will be hosting speakers such as Lorin Hochstein, Charity Majors, Dr. David Woods, John Allspaw, myself and many others. It's an opportunity to get together, in person, with some of the most brilliant minds in incident management in the software industry.
CfP submissions are due on November 29th. This is open to everybody not just members of the LFI community. If you are working in incidents, impacted by them, have stories to share about how you've learned from them. We want to hear from you.
How to get involved:
We're working on our community admission process (which currently requires being nominated by a member). However, we welcome folks from all over the tech industry (and outside of it!) to join the LFI conference. It will be a great opportunity to meet the folks who have been participating and innovating from this community for the past 4 years.
This community is run completely for free, by the good nature of the admins. The conference is being put on by my company, Jeli.io, but we are opening up, for the first time, the opportunity to sponsor this event. We greatly appreciate the support so that we can continue to make this a great event for years to come. Email email@example.com if you'd like to participate or are interested in connecting about the opportunity.
DevGuild Incident Response 2022:
Last week, community members Sarah Butt (Salesforce), Dylan Bochman (Spotify), Erin McKewon (Zendesk), and myself, sat down for a roundtable on Incident Response. The day before, community member Liz Fong-Jones (Honeycomb) participated in a panel on observability as It relates to incident response.
"When I talk to people that say 'we read this book and now we want to do SRE." or "we heard this talk and now we want to do SRE." The conversation I always encourage people to have is to explore what SRE would mean to your organization outside of all the buzzwords. And exactly why you need it." - Sarah Butt, Salesforce
A peak at what's happening inside the LFI community this week:
What we're reading:
A small group in the LFI community has been reading "Still Not Safe" by the late Robert L. Wears on the patient safety movement - a group of physicians in the 90s seeking to raise awareness of the tens of thousands of deaths in the US attributed to medical errors each year. Wears comes from a medical background and wrote a paper I share to most everyone I encounter in the software industry, "The Error of Counting "errors"", which in just 2 pages, illustrates the downsides of counting your errors, and the complications it can bring your organization. While we are not medicine, the "errors" we count, impact the people in our organizations, which impacts the product we ultimately deliver to the end user. Note: these complications don't come the next day -- they culminate through years and can take place in the form of failed projects, missed deadlines, attrition, and over-hiring.
I find the "Still Not Safe" book to serve as an active warning to software, to not fall into the trap of just measuring numbers and errors, simply because it's easier. We need to make it psychologically safe for our organizational systems to also be scrutinized after incidents -- and this involves leadership being willing to look at hard truths, and it involves workers being comfortable hearing about the sharp end of management (the context they might not always be privy to).
What we're listening to:
I just finished listening to a 2-part podcast called Labor Management Relations (from the series called Talking about Organizations Podcast). They do a fascinating deep-dive of Tom Lupton's book, "On the Shop Floor: Two Studies of Workshop Organization and Output". It was a fascinating deep-dive into how the systems set up in our organizations, deeply impact how people work and perform.
In the second case, there are much more examples given of the social punishment on working too fast vs too slow, and the issue with this was the incentive scheme. There ends up creating a giant tension between what the workers think "right" looks like, and what management thinks "right" looks like. Sound familiar?
I've been thinking a lot about these types of comparative studies lately -- such as what can happen to an organization when they do a shallow incident review, vs a thorough incident review. I'll be planning to talk about this with Dr. Laura Maguire in our talk at SRECon APAC on December 7.
What we're watching:
There is a relatively new series on Netflix about Three Mile Island (April 2022), which actually is the event that spawned the Human Factors and System Safety movement, and informed a lot of today's thinking about human practitioner cognition. Check out Richard Cook's prescient talk on the subject. If you're watching the Netflix series, this will help add a lot of context to what you're seeing.
We are also, along with the rest of Ops, SREs, etc, watching the Twitter saga unfold. While we may see some folks celebrating Musk for cutting staff and the website staying up; a website doesn't fail overnight. Most engineers in the industry know that a major part of the last decade of hardening and building distributed internet systems so that the fail whale isn't standard, were taught to us, in part, by Twitter Engineers. A blog post on developer efficiency personally changed my career back in 2015, and introduced me to the brilliant work of Caitie McCaffrey and several other Twitter engineers at the time. This is all to say, the engineers that have been at Twitter throughout the years, set up the platform to not fail overnight, they are the reason it is still functioning today.
That's all for this week, see you next Sunday!