Setting the scene
You’re an engineer on a team with an on-call rotation - it’s been your gig for a few years. It’s Friday afternoon, chatter in work chat has started to slow down as people are heading out to start the weekend.
Just as you are leaving, someone from the support team pops up in the Operations team on call channel saying “hey...I just got a report from a couple of customers that they cannot change passwords on their accounts on our site”. Within seconds, someone from the frontend team chimes in “hey...is our website ok?”
If you have been in the business of managing software as a service long enough, this is already giving you the well known tingles of “I know this...I have been through this”. I call this ‘the smell’. Half traumatic, half familiar, definitely not a good feeling but known all the same.
No outage has been declared yet. Conversation in the chat channel is spinning in circles. There is no incident manager ‘to declare it’ even though we can all tell something is not right and are acting accordingly.
Debates during incidents
Teams can find themselves debating ‘what is an outage?’, ‘who can declare an outage?’, ‘what level does it need to be declared for?’, ‘who do we need to tell about this?’. They can easily spend a long time on this discussion.
They’ll maybe even continue the discussion with:
“Is this a good thing to spend the team’s time on?”
“How should this go, ideally - do we have a runbook?”
These are existential questions that most companies grapple with during incidents at some point in their lifetime. While some of these are healthy signs of company growth and expanding customer base, they are still not easy questions and can increase coordination costs.
We can likely all agree on some common needs within incidents:
- Declare an outage when customers are actually impacted
- Know when customers are impacted before they have to contact us and let us know
It may seem like these two needs are contradictory, but they don’t need to be. These goals can become attainable if we have the right signals alerting engineers and an engineering team that has a mental model of their system that is constantly adapting to learn from what they encounter. In order to continue adapting this mental model, companies need to put in the work, it doesn’t come for free.
There are always easy to understand, “low hanging fruit” failure cases that are undeniable. Maybe at your company it’s when a database primary has failed and we have not switched to a new host yet. As a result, some writes are now failing and customers very likely cannot use a subset of the site functionality.
But what about more “obscure” things that folks don’t have expertise in - a Redis cluster used for caching has failed, or a pub sub system has slowed down to 10% of its normal throughput? In these cases, the context of the architecture becomes even more important. Because people built these systems, it quickly becomes clear that the architecture we intended to build vs the architecture's designed behavior are not always the same thing. The mental model we all create of the systems we build need to include the people operating these systems and what information we present to them during operating these systems. The decisions made can change how incidents evolve and so discounting the process by which these operators come to decisions is akin to intentionally building systems with partial data.
People as part of the system
When the first inklings of the noise in chat come up (people unsure if there is customer impact, how big it is, and whether an outage needs to be declared) it is sometimes a smaller subset of the team that seem to have the first hunch on what is wrong. Why? While incidents do not repeat, they do rhyme in how they tend to break. It is the heuristic experiences of team members who have been around longer that tend to lead the way to find that smell, following it to what is actually happening. Recognizing humans as part of the system is an important piece of how a team conducts incident reviews later with the goal of learning how to be resilient.
If the decisions we make during an incident impact how the incident progresses, then ignoring those decision making processes and what information they are based on ‘in the moment’ is detrimental to how well the team will learn.
While some incident reviews may prioritize goals above learning, one would hope that this pursued with some measure of due diligence.
So what makes some more capable of detecting trouble sooner than the rest?
And -- how do we transfer that skill more reliably to the rest of the team members and make it a reliable part of new members’ on boarding?
- Most importantly, encourage curiosity. Our industry loves the new shiny thing. There is no shortage of ‘curiosity’ and ‘hunger’ when using the new tech, building the new feature. However, curiosity with the existing and sprawling systems has immense value as well. That random spike of 5xx responses that ‘self resolved’ should invoke questions. That random report of increased latency could very well a canary in the outage coal mine. The difference is “do we follow the leads and spend some cycles in the team to discuss findings?" or “do we breathe a sigh of relief we didn’t hit outage thresholds and move on to the new thing we are building?"
- Everyone should be able to declare an outage, but also create a path to communicate “no customer impact” post fact. Any rule that designates some role for ‘who can declare there is an incident’ is constraining both the team’s ability to learn more deeply about the system and to constantly align their mental model of the services with what is actually happening. Whether an unexpected event meets ‘customer facing incident’ thresholds or not, it should still be treated as a learning opportunity for the team to align their mental models with the system behaviour. So separating customer communications from actual learning from unexpected events will increase the capacity to learn in the organization.
- Runbooks must be written by the team(s) associated with a service. Runbooks that are consistently revisited after incidents can go a long way to improving our understanding of what is happening in the code. When a runbook explicitly states to call one person in particular to fix a service, it is an opportunity to write a better runbook, disseminating the localized expertise. The organization needs to recognize human single points of failure as both a burnout risk for that person and a business liability for the team when that person is absent/unreachable.
- Runbooks are never thoroughly exhaustive. Psychological safety in the team is paramount for the continuous improvement of runbooks, design documents, and the team’s understanding of the system. Outages are stressful enough. Team members need to feel safe in those times to ask questions, probe, and compare what the documentation says is supposed to happen vs what is really happening.
- Incident reviews should not be exclusive to ‘customer facing incidents’. Often I see teams becoming less constrained in their learning when the ‘incident’ was actually a near miss (‘we got lucky’). These also deserve learning reviews as opportunities to improve a team's understanding of their services and the weak points within.
Incidents and near incidents tend to be discovered first through pattern matching and familiarity by those who have been working in ‘the system’ for a long time. Teams can either use these incidents to teach everyone or they can become a reason for further reliance on human single points of failure and siloed knowledge. It's up to the organization to decide which path to take.
Klein, et al. 2016. “Can We Trust Best Practices? Six Cognitive Challenges of Evidence-Based Approaches”
This paper helps us to understand limitations runbooks can play in success of incidents.