The Changing Climate of Resilience in Software

December 10, 2019

Resilience Engineering, often shortened to RE, is heating up.

Practitioners are clamoring for "what's next" in the field of incident management and wrangling the complex systems which our society increasingly relies on in big, small, and surprising ways. Tracks on RE and related topics at well-established industry conferences, community watering holes (like this one), and even an industry conference devoted entirely to RE all point to a growing curiosity in Resilience Engineering's concepts from the software development and operations field. 

Despite this interest, full-on adoption of Resilience Engineering's principles in the practice of developing and operating large software systems remains relatively nascent. This reality presents RE proponents with an interesting question: if there's demonstrable interest in exploring how the cultivation of Resilience can benefit day-to-day work and materially improve outcomes, then why is that adoption still fledgling? In a recent chat about this dichotomy, an RE colleague mused that Resilience Engineering adoption felt not unlike the challenges facing us in another domain: global climate change.

The claim may, at first blush, sound a bit sensational. But addressing climate change in any coherent, sustainable way shares many of the same challenges we face when arguing that cultivating Resilient practices offers our industry a tool to deal with some of its thorniest challenges:

  • Winning Friends and Influencing People. Many people's initial interactions with the Resilience Engineering community can be, for lack of a better word, off-putting.

    We rail against root cause. (And the mere utterance of those words!) We tell you to stop—stop, stop, stop, and stop—using Five Why's. We criticize the way many hold their incident retrospectives ("You use it to come up with remediation items? How quaint...") We nag about blame, even when established, reasonable cultural norms have clearly been violated. In short, our initial conversations with many developers, operations engineers, and managers can solely appear to be a treatise on what they're doing wrong with their lives... at least if they want to live a Resilient one.

    This is not an unfamiliar challenge to climate change activists: their early messaging also often came across as "Everything you're doing in your life is wrong." Drive a car? You're wrecking the planet. Eat a hamburger. You're choking the atmosphere with cow belches. Use a plastic... anything? You monster.

    This is an important dynamic to recognize precisely because while the science may support both those climate activists and us RE proponents, the entirety of our initial message can feel like a pretty visceral, context-free attack on how others lived their lives or conducted their careers. And while we may not be wrong, those initial impressions matter... and ours can be harsh to confront without feeling at least a little bit insulted or frustrated.
  • Demonstrating the Dividends. Many of the investments Resilience Engineering asks organizations, teams, and individuals to make are expensive, certainly in terms of expended time, but often monetarily too. Of course, set in the context of helping to prevent multi-million dollar outages, the investment is a sound one... over the long term. But in the context of day-to-day operations or lines on a headcount budget spreadsheet, the soundness of that investment isn’t always immediately obvious.

    Similarly, climate change requires we make some pretty fundamental changes to the way we run societies and economies globally. Opponents of the idea that climate change is harmful to humanity continue to (compellingly!) make the argument that the investment in renewable energy sources, de-carbonizing the economy, and other green technologies just is not worth the cost... at least when we're not faced with consistently-yearly storms which cause multi-billion dollar damages.

    Fundamentally, addressing both climate change and implementing Resilience Engineering require being open to investing in the future... and for those already underwater (operationally, at least), it can be a hard sell to pitch solutions with a heavy up-front cost... even if we know it can pay off (and even if the payoff is “merely” future disaster mitigation).
  • The Root Cause of the Storm. Relatedly, climate scientists are often asked if specific heatwaves, firestorms, hurricanes, or floods are "due to climate change." Their answer is always the same: "Climate change cannot be linked as the root cause of a single weather event. (Also, the weather is not climate!)"

    And similarly, no honest Resilience Engineering practitioner would assert that the root cause of an incident was a failure to employ RE practices. (And, in the same vein, no one should believe any RE practitioner who claims an incident was avoided, due to the RE work.)

    In a business and management climate that relies on linear explanations for events and clearly established cause-and-effect relationships, the inability to demonstrate these conclusively (at least while maintaining one's professional integrity) presents a real hurdle to explaining Resilience Engineering, to say nothing of practicing it... just as it does to climate scientists begging us to change.
  • The Science is Solid. While investments toward a carbon-neutral future and practicing Resilience in the day-to-day of our work can carry a cost and the benefits appear nebulous and difficult to describe, there is plenty of science illustrating the value of both these efforts.

    Interestingly, safety scientists observing the effects of Resilience Engineering as its practices are deployed in organizations are still surprised at some of the effects... and we haven't mapped out all of the mechanisms yet. This, too, is not unlike the study of the hyper-complex system this is our weather and global climate. But as both sciences continue to collect data about definable outcomes—be they split second, in-situ improvements in incident handling, concrete updates to teams' below-the-line understandings of their system, or the patterns of hurricanes—we add to the body of knowledge that there are observable effects. And whether they're beneficial (as in the case of Resilience Engineering) or a serious wake-up call (in the case of climate science), the criticality of the call to action is clear.
  • Once seen, forever changed. In my years of working on Resilience Engineering efforts with various organizations, one observation continues to be evergreen: once a developer, operations engineer, or manager, via some of the Resilience Engineering techniques we practice together, gets a sense of the true complexity of their socio-technical systems, has a hidden, second-story resonate with them personally, or wrestles their system back across the wrong wide of the Boundary of Acceptable Risk, narrowly avoiding an incident: they cannot unsee it. And it is usually at this point that they begin to understand the true value of Resilience Engineering and organizational learning.

    As with climate change, whether it's hearing about native peoples of Alaska being forced to relocate or watching walruses unknowingly plunge to their death, once you see the aggregated effects of a warming world, most of us can't unsee it.

For both Resilience Engineering and global climate change, there exists a class of experiences that solidify adherents and bring them an intrinsic understanding that is incredibly compelling... but those experiences take time to be set in a context that will resonate with us all. And we, as RE proponents (or climate change activists) must be patient as we join others on their journey toward these experiences.

Solving climate change has been described as a "wicked problem": “a problem whose social complexity means that it has no determinable stopping point.” Given such a definition, it might be the case that deploying sustainable Resilience practices in our organizations and disseminating the behavior throughout our industry also represents a challenge that is also "wicked" in nature.

Either way, it is incumbent upon us, the Resilience Engineering community, to demonstrate that our novel, Rube Golbergian view of our industry's socio-technical systems—while accurate—are more than just novel and amusing to look at. It's on us to not only demonstrate, but cultivate the value proposition for RE... and we must do so in a way that doesn't immediately feel like a challenge or attack to our colleagues’ established practices and beliefs.

The clock is ticking.


J. Paul Reed

J. Paul Reed began his career in the trenches as a build/release and operations engineer. After launching a successful consulting firm, he now spends his days as a Senior Applied Resilience Engineer on Netflix's the CORE team, focusing on incident analysis, systemic risk identification and mitigation, Resilience Engineering, and human factors expressed in the streaming leader's various socio-technical systems.

Reed is an internationally recognized speaker on operational socio-technical complexity challenges and opportunities, Resilience Engineering, and DevOps and holds a Masters of Science in Human Factors & Systems Safety from Lund University.

Related Posts

No items found.

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form