Blameless has been speaking to on-call engineers...
We've found that incidents take longer than you think. You might look back at an incident and remember how long it took to find and implement a solution, but there's so much more than that. We broke down every part of incident management using the experiences of your on-call peers to see how much time you really spend.
Responding to the Alert
Spending too much time on incidents can cause problems in a variety of areas, including...
Delays planned work
Every minute you spend fixing an incident is a minute that could have been spent on new feature work.
Engineers burn out
No engineer wants to spend all day fighting fires. Less time on incidents means happier engineers.
Reputation takes a hit
If you don't handle incidents quickly and thoroughly, customers will worry and look to competitors.
WHAT IS AN INCIDENT?
Incidents are more than just bugs or outages. They’re anything that pulls you from your planned work and they last until you’ve returned to your planned task.
Responding to the Alert
When you get the alert that something is wrong, it takes you some time to process and act on the information.
minutes per
incident
incident
Wrapping up Previous Work
You probably want to leap immediately onto solving an incident you’ve been alerted to, but things aren’t so simple. Every interruption comes with some time spent wrapping up the previous task.
minutes per
incident
incident
Classifying the Incident
Before you know how to triage and respond, you need to judge the severity of the incident and the areas it affected. This will inform who you invite to resolve it.
minutes per
incident
incident
Gathering Responders
Few incidents can be resolved totally solo. You need to look up on-call schedules and figure out service owners and subject matter experts, then ping them to gather in a central place.
minutes per
incident
incident
Coordinating Work
Once you have your dream team prepared, you need to deploy them efficiently. Making sure every task has been covered without redundancy isn’t trivial.
minutes per
incident
incident
Collecting Information
Understanding what’s going wrong requires gathering information from your monitoring tools. You need to know about how your system is functioning vs. when it was in a healthy state.
minutes per
incident
incident
Diagnosing the Problem
Now that you have some context, you need to zoom into the specific problem through testing. Analyze your codebase and architecture to pinpoint where things are breaking.
minutes per
incident
incident
Communicating Incident Status
While you're busy working, many stakeholders – from customers to executives – will want to know what’s happening. Time needs to be allocated to inform them.
minutes per
incident
incident
Devising a Solution
This is the biggest chunk of the resolution process, and the one that can vary the most in time spent. Sometimes you’re up all night, and sometimes the fix comes to you in an instant. An hour is an optimistic estimate based on conversations with your peers.
minutes per
incident
incident
Implementing the Solution
This stage can also vary wildly in time, from a single line of code needing a change, to a tedious migration of databases, to a total architectural overhaul. We’ll remain optimistic with a half hour estimate.
minutes per
incident
incident
Testing the Solution
As tempting as it is, deploying the first idea you come up with can lead to an even worse disaster. Spending time for quick Q&A is vital.
minutes per
incident
incident
Deploying the Solution
Depending on your architecture and deployment process, this can take a long time or a little. We’ll reflect a fairly automated and mature process.
minutes per
incident
incident
Verifying that the Problem was Fixed
Now that the fix is out in the users’ hands, you need to make sure it actually… fixed it. Running some more tests aligned with your first diagnosing makes sure you’re in the clear.
minutes per
incident
incident
Now that the solution has been deployed and the problem has been solved, you’re done, right?
WRONG!
You’re just getting started.
Summarizing the Resolution
You need to create a retrospective document as a resource for future incident responders. Step one is collecting and summarizing what went wrong, what was tried, what happened, and when.
minutes per
incident
incident
Judging the Impact
The next step for your retrospective is figuring out how big an impact it actually had. Look at how many users were affected, how badly they were affected, and for how long.
minutes per
incident
incident
Tracking the Impact
Now that you understand the impact the incident had, see how it changes your tracked metrics such as overall uptime.
minutes per
incident
incident
Analyzing the Causes
This is the most substantial part of your retrospective. Work together to think holistically about every factor that contributed to the incident occurring. Dig deep, and think about the causes of causes.
minutes per
incident
incident
Devising Systemic Changes
When you’ve identified the causes of the incident, do what you can to prevent them from recurring. Find what systemic changes can be made to prevent those scenarios. They can be code base changes, new policies, additional resources, and more.
minutes per
incident
incident
Implementing Systemic Changes
Actually implementing the changes you’ve prioritized could take minutes, hours, or weeks. They’ll likely involve other teams. But merely tracking and allocating these tasks is time consuming enough.
minutes per
incident
incident
Communicating the Retrospective
Many stakeholders will need to know what went wrong and how it was fixed. Take time to make sure they receive the retrospective document and answer questions about it.
minutes per
incident
incident
Refocusing on your original Work
You’re finally done with incident related work. Hooray! However, the incident isn’t really over until you’re back making progress on your original task. Don’t discount the time it takes to refocus on where you were before.
minutes per
incident
incident
So, what’s our
Grand Total?
Thinking holistically, we find that each incident causes a delay of 475 minutes, or almost 8 hours. That’s a whole working day spent resolving an incident!
475
Minutes
There probably hasn’t been many times where you consciously spent a whole day on an incident. If there have been, we feel for you. Often these tasks are distributed over multiple days, or handled by multiple people. But that time is still being spent, and planned work is being delayed.
If a typical engineering salary is about $150,000 per year, this will cost you $660 in engineering time alone. If engineers are dealing with an incident a week, this adds up to $34,320 in engineering costs for each engineer. Check out our Return on Investment calculator to see more about the costs of incidents.
The good news is that Blameless can reduce many of these times. We make many incident tasks automatic, letting you focus on resolving the problem fast with our role-based guidance. After you're back online, prevent repeat incidents and strengthen your system with our suite of features for incident learning.
Find out more by signing up for a demo!
Find out more by signing up for a demo!