We love
incidents.

Well, love is a strong word. But, incidents don't have to be a terrible experience. The community we started has spent a lot of time understanding just how much value you can get out of incidents, and how they can be used as a way to disseminate expertise throughout the organization. We are here to reshape how the software industry thinks about incidents, software reliability, and the critical role people play in keeping their systems running. If this community is successful, people will be doing and thinking about incident analysis in completely different ways than they were doing it before — as a valuable lens into not only where incidents come from, but what normally prevents them, what people do (and don’t) learn from them, and what makes incidents matter long after the dust has settled. We’re here to challenge the conventional views and we’ll be bringing research to back us up. A couple hundred of us have been working together to bring this thinking into the industry, since February 2019. This is where we write about our stories and learnings from understanding the field of incident analysis and what it can bring to the software industry. Learn more in our introduction post.

Meet the admins

Nora Jones
Nora founded the LFI community and website as a way to show organizations how to get more ROI out of their most powerful investments -- their incidents. You may have seen her keynote at re:Invent in 2017 on Chaos Engineering and how it can help organizations level up their understandings of their systems. Since then, her view on these topics has grown and she's become increasingly interested in the fields of Human Factors and Systems Safety as they pertain to the software world.
John Allspaw
John is an engineering leader and researcher with over 20 years of experience in building and leading teams engaged in software and systems engineering. He's has spent the last decade bridging insights from Human Factors, Cognitive Systems Engineering, and Resilience Engineering to the domain of software engineering and operations.
Richard Cook
Dr. Richard Cook is a research scientist, physician, and pioneer in Resilience Engineering for safety in complex risk-critical worlds, and author of the seminal paper “How Complex Systems Fail” as well as Behind Human Error (2010). Richard is presently a research scientist in the Department of Integrated Systems Engineering at the Ohio State University and emeritus professor of healthcare systems safety at Sweden’s KTH.
Jessica DeVita
Jessica has spent 20+ years in IT Operations roles across companies such as Netflix, Microsoft, Chef, and St. Jude Medical. Currently at Microsoft bringing human factors perspectives through leading the AKS SRE team and partnering with AKS product engineering teams. Her thesis research for the Human Factors and Systems Safety masters program at Lund University is about software and configuration deployment decision-making.
Will Gallego
Will is a Software Engineer working on the web. He likes to learn as much as I can about anything he can and always works to help others. Will is a fan of collaborative approaches to software engineering, after incident reviews (postmortems/retros), human-centered engineering, and distributed systems. Oh, and gif pronounced with a soft G.
Vanessa Huerta Granda
Vanessa is a Solutions Engineer at Jeli. She has previously led Resilience Engineering and SRE teams focusing on the Production Incident processes, learning from incidents, and leading on-call rotations of Incident Commanders.
Lorin Hochstein
Loren is a Senior Software Engineer at Netflix on the Cloud Operations & Reliability Engineering (CORE) team. Once upon a time, he studied software developers to understand what problems they face and to evaluate whether proposed technologies actually make their lives easier. Nowadays, he studies operational surprises (some people call these incidents) in order to improve how Netflix engineers software.
Laura Maguire
Dr. Maguire is a researcher producing human-centered design guidance at Jeli.io. Her doctoral work studied distributed incident response practices in DevOps teams responsible for critical-digital services. She was a researcher with the SNAFU Catchers Consortium from 2017-2020, and her research interests lie in Resilience Engineering, coordination design, and enabling adaptive capacity across distributed work teams.
Ryan Kitchens
Ryan is a senior engineer on the CORE team at Netflix where he works on building capacity across the organization to ensure its availability. Before that, Ryan was a founding member of the SRE team at Blizzard Entertainment.