Most companies have some rules or guidelines in place for whenever an “incident” happens, be it an outage or something else that impacts customers. Some companies call them incident reports, others root cause analyses. Sadly, too often, producing these types of reports is seen as a chore — essentially punishment for standing somewhere close to a failing system. They end up writing reports that point just enough fingers in other directions to get away with it. “It’s bad enough that we had to work until 3am to fix it, now we have to write a report about it too?”
While this attitude towards incident analysis is very understandable at a human level, we believe it’s the wrong way to look at it. At OLX we focus heavily on learning. And by far the best opportunities to learn are right after the shit hits the proverbial fan.
At our company, incidents happen, as they do happen everywhere. We have people on call that can generally resolve issues quickly. Then, the next day, after the dust has settled, all people involved are tasked with answering four questions:
- What happened?
- Why did it happen?
- What can we learn from it?
- What can we do to prevent such issues in the future?
To achieve this we used various techniques over time. Recently, we’ve been moving more towards using Toyota’s “Five Whys” technique. Not all our packs are applying this yet — but we’re seeing a lot of value already.
The goals of a Five Whys session:
- Identify the root cause of an incident. At first sight this root cause may seem obvious to everybody, but more often than not, a well-conducted five whys session leads to surprising results.
- Learn from mistakes. Incidents happen, and they should be treated as a learning moment.
- Share the learning by sharing the report widely.
Notably, the goal of this session is not the report. Even in the hypothetical case where nobody were to read it, the method should have increased insight into deeper issues in the tech, product, sometimes even organizational culture and valuable action items.
The goal of the session is also not to point fingers. Yes, somewhere along the line people made mistakes, but this is bound to happen — our goal is to create an environment, a system, that minimizes the impact of such human mistakes.
Here is probably my favorite quote on this topic that perfectly captures what the goals is of the Five Whys:
“Let’s plan for a future where we’re all as stupid as we are today.” — Dan Milstein
Before we get to the recipe for conducting a five whys (which is, in a sense, shockingly unsurprising) let’s start with some basic rules.
- The meeting happens after every major incident, no exceptions. Especially if the incident happened before (or even even occurs regularly). “We don’t need a five whys, we know the root cause” is not a reason to skip.
- All people involved are present in one room (real or virtual) discussing the issue synchronously.
- The meeting happens within a few days of the incident. It can happen the same day, but there’s value in having it after the heat of the moment — allowing some time for reflection.
- Action items must be SMART — specific, measurable, attainable, relevant, and time-based. Also, they should be assigned to a specific person (not a team).
- Action items must be prioritized proportionally to the severity of the incident. For instance, if the incident was that some color of a button changed accidentally — prioritizing end-to-end tests that verify color of DOM elements is probably not proportional.
The (not so secret) recipe
Here’s the basic formula:
- Start with the problem statement. Write it down on a whiteboard or Google Doc.
- Ask the “why?” question. No productive “why?” question to ask? Really?
- Answer the “why” question.
While traditionally the line of asking “why?” is linear, we have seen good results trying asking multiple different “why?” questions at each level.
So not plainly “why?” but various more specific whys, for instance:
- Why did this occur? (root cause)
- Why did we only find out when people complained about it on twitter? (monitoring)
- Why did it take us 5 hours to find the root cause? (mean-time-to-recovery)
- Why did this problem have the impact that it had? (impact minimization)
This approach tends to result in “why forests” — but that’s ok. It’s the job of the session facilitator to figure out which “branches” can lead to productive results.
You may wonder: what’s up with the five aspect of “five whys”? It originates from is the general heuristic that asking “why?” five times should get you to the root cause. If you ask it once or twice, you’re probably just scratching the surface of an issue. Beyond that you tend to get into deeper cultural, organizational or systems level problems — and this is where the big wins can be made.
To learn more about the five whys and best practices, here are three must-read articles:
If you have a feeling that “stuff keeps breaking over and over” in your organization, have a serious look at the Five Whys. Find somebody internally who’s already familiar with them or passionate to learn, or hire a trained facilitator. Then, start building that improvement muscle.
Mistakes will always be made, the best thing you can do is to make those mistakes productive.