Alert Fatigue in SRE: What It Is & How To Avoid It
Wondering about alert fatigue? We describe what it is, how it affects software development teams, and how to avoid it.
What is alert fatigue?
Alert fatigue is the phenomenon of employees becoming desensitized to alert messages because of the overwhelming volume they receive, and the number of false positives they receive. The risk with alert fatigue is that important information will be overlooked or ignored.
Recognizing alert fatigue
Setting up an alert system to notify people when something breaks in your system is a necessary part of your reliability solution. You want your monitoring tools to detect any anomaly in your system, and your alerting tools to respond appropriately as soon as detection happens. Your top priority is to avoid missing any incidents, as an incident that goes unnoticed can cause continuous damage to your system and customers.
This priority of not missing incidents can lead to an alerting system that overreacts. This can lead to alert fatigue, which could be even more damaging. Alert fatigue leads to employee burnout and inefficient responses. In the worst cases, it leads to many missed or ignored alerts, creating the same problem the alerting system was meant to solve!
How do you tell if your alert system is overreacting and your employees are suffering from alert fatigue? Let’s take a look at some common alert fatigue symptoms:
Engineers are alerted when they aren’t needed. When something goes wrong, it’s tempting to get all hands on deck. However, only people who have the expertise to help with the response or have responsibilities affected by the incident that require them to be directly involved in the resolution process, should be involved. If your system is bringing in people to do nothing, people will start to expect that an alert might not mean they actually need to do something and could ignore them.
Small incidents get major responses. If engineers are woken up in the middle of the night, brought in on huge team responses, or told to prioritize a response over any other tasks, they’d reasonably expect that the incident must be severe – something causing immediate customer pain. If instead they’re brought in urgently to deal with minor troubles, like an unpopular service running slow, they’ll become desensitized to alerts and not respond quickly when major incidents do occur.
Engineers are burnt out. Being alerted is stressful. Each time an engineer’s pager goes off, there’s some cost to their focus or their rest. You have to weigh the cost to the engineer’s ability to do good work with the benefit of alerting them. As stress accumulates, engineers can become burnt out – totally unable to do good work, and very likely to leave the organization. Burnt out employees lower morale and slow progress. If you’re noticing stressed out and unproductive engineers, alert fatigue could be an issue.
Solving alert fatigue
Getting alert fatigue under control can be a challenge. Finding the line where an alert system doesn’t over alert people but also doesn’t miss any incidents can be tough. However, it’s worth investing in solving this problem. These techniques not only address alert fatigue, they make your system more robust and informative.
1. Make a robust classification system
All incidents aren’t created equal, and they shouldn’t have the same responses. We discussed how important it is to get only the right people involved when something goes wrong – it’s much easier to not be fatigued when every alert you get is actually relevant to you. Classification is how you determine and alert based on the severity and service area of an incident.
Building a classification system is a collaborative and iterative process. Each service area’s development and operation teams should have input on marking which incidents they have expertise and ownership over. They’ll also be experts on how to recognize severe and minor incidents for their service area. You won’t build a classification system perfectly the first time – review how each incident was classified and who was alerted after each incident, and refine these to make sure the right team was on board.
There can be disputes between teams as to how to judge severity. The solution is to use customer happiness as a universal metric for the whole organization. Using SLIs and SLOs, you can build a metric that reflects if customers are happy with any particular experience using your services. When an incident occurs, you can judge how much it disrupts that experience, and use that as the basis of your severity. If people know what an alert means in terms of user happiness and business value, they’ll feel less overwhelmed and fatigued.
Service ownership will help:
- Prevent code from being “thrown over the wall” and encourage learning. When developers are responsible for owning the services they build, it encourages better practices around shipping performant code. It’s easy to overlook potential flaws in your code when you’re not the one supporting the service, and focusing on reliability over shipping new features doesn’t seem as important when you’re not the one being paged over the weekend. But when service ownership is encouraged, developers will begin scrutinizing their code more thoroughly. Additionally, action items and learnings from incidents are more likely to be fed back into the software lifecycle (SDLC) when developers are in the loop on what issues are occurring.
- Keep teams under pressure from burning out by sharing the load. If your traditional ops team is up every night, working 70-hour work weeks and spending every weekend with their laptop, it should come as no surprise that eventually productivity will falter. People need breaks and time away from work. Without that time to themselves, teams under pressure will be susceptible to burnout. Management will be stuck in a cycle of hiring and training for roles they filled just a few months ago. Service ownership helps spread the on-call responsibilities out so that everyone has a turn carrying the pager. This can also have the unexpected benefit of familiarizing on-call engineers with their product a little more, as they’ll have to triage it during incidents. Service ownership helps balance on-call to keep engineers practiced and prepared.
- Create the same incentive for everyone, limiting siloes. Service ownership can also help ease the tension between innovation and reliability, and will encourage even heavily siloed organizations to talk between teams in order to prioritize feature work and reliability work better. Everyone wants to move fast, sleep well, and have strong enough systems that they aren’t alerted about an issue every time they check their phone.
2. Create runbooks for your alerts
What happens when you receive a notification that something is wrong with your system and you have no clue what it means, or why you’re receiving that alert? Maybe you have to parse through the alert conditions to suss out what the alert indicates, or maybe you need to ping a coworker and ask. Not knowing what to do with an alert also contributes to alert fatigue, because it increases the toil and time required to respond.
To resolve this, make sure that you create runbooks for the alert conditions you set. These runbooks should explain why you received this alert and what the alert is monitoring. Runbooks should also contain the below information:
- Map of your system architecture: You’ll need to understand how each service functions and connects. This can help your on-call team have better visibility into dependencies that might be what ultimately triggers an alert.
- Service owners: This gives you someone to contact in the event that the alert is still not making sense, or the incident requires a technical expert for the service affected.
- Key procedures and checklist tasks: Checklists can give on-call engineers a place to start when looking into an alert. This helps preserve cognitive capacity for resolving the actual issue behind the alert.
- Identify methods to bake into automation: Does this alert actually require human intervention? If not, add in scripts that can handle this alert and which notify you only if the automation cannot fix the issue for you.
- Continue refining, learning, and improving: Runbooks are next to worthless if they aren't up to date. When you revisit these to make updates, take the opportunity to learn from them again, looking for new opportunities to automate and optimize.
Runbooks can be a huge help to on-call engineers, but they need to be made easily accessible to the right people on the front lines. In addition to making sure your team has all the information they need to deal with alerts while on-call, it’s also important to take a deeper look at the on-call schedule you maintain.
Maybe you have to parse through the alert conditions to suss out what the alert indicates, or maybe you need to ping a coworker and ask. Not knowing what to do with an alert also contributes to alert fatigue, because it increases the toil and time required to respond.
3. Add nuance to your on-call schedule
On-call engineers are those that are available to respond to incidents the moment they happen. Generally, engineers will take rotating shifts of being on-call. This allows for some responders to be available 24/7 while also giving everyone periods of total rest. Making an on-call schedule that is fair, keeps people from burning out, and responds to incidents effectively can be challenging, but it’s necessary.
When building an on-call schedule, your first instinct may be to give everyone an equal amount of time for their shifts. However, not all shifts are the same. Incidents can often correlate with periods where services are used more frequently, or when updates are pushed. By tracking patterns in incidents, you can judge when severe incidents occur most frequently. You can also judge what types of incidents are most difficult to resolve, which can be different from those that are most severe.
By having this more nuanced understanding of incidents, you can judge the overall “difficulty” of on-call shifts. This likely corresponds to how much alert fatigue will accumulate from working that shift. Balancing on-call schedules with more nuance will greatly reduce the alert fatigue of any given engineer. Of course, you won’t get the perfect balance right away. Reviewing and adjusting on-call shifts continuously is necessary to keep everyone at their best. The most important thing is to communicate and empathize with on-call engineers to make sure their needs are being met.
4. Set SLOs to create guidelines for alerts
Sometimes the problem isn’t that you have too many incidents; instead it could just be that you’re alerting on the wrong things, or set the wrong alerting thresholds. To minimize alert fatigue, it’s important to distinguish what is worth alerting on and what isn’t. One way to do this is with SLOs.
SLOs are internal thresholds that allow teams to guard their customer satisfaction. These thresholds are set based on SLIs, or singular metrics captured by the service’s monitorable data. SLIs take into account points on a user journey that are most important to a customer, such as latency, availability, throughput, or freshness of the data at certain junctions. These metrics (stated as good events/valid events over a period of time) indicate what your customers will care most about. SLOs are the objectives that you must meet to keep them happy.
Imagine that your service is an online shopping platform. Your customers care most about availability. In this case, you’ve determined that to keep customers happy, your service requires 99.9% availability. That means you can only have 43.83 minutes of downtime per month before customer happiness will be affected.
Based on this, you have wiggle room for planned maintenance or shipping new features that might risk a potential blip. But how much wiggle room are you comfortable with? An error budget policy is an agreement that all stakeholders make on what will be done in the event an SLO is threatened.
So imagine that out of your 43.83 minutes, you’ve used 21 minutes. Do you need minute-by minute alerts? Not likely. Instead, you’ll want to set up alerting thresholds that let you know when you’ve reached certain milestones such as 25%, 50%, 75%, and so on. Maybe you even automate these alerts so that, in the event that you hit these thresholds during different times of your monthly rolling window, you aren’t alerted at all. For example, you might not care that 75% of your error budget has been used if there are only 2 days left in the window.
By setting SLOs and creating thresholds that apply to them, you can minimize the unnecessary alerts you receive, allowing you to pay attention to the ones you really need to know about.
How much wiggle room are you comfortable with? An error budget policy is an agreement that all stakeholders make on what will be done in the event an SLO is threatened.
5. Set SLOs to create guidelines for alerts
Sometimes the problem isn’t that you have too many incidents; instead it could just be that you’re alerting on the wrong things, or set the wrong alerting thresholds. To minimize alert fatigue, it’s important to distinguish what is worth alerting on and what isn’t. One way to do this is with SLOs.
SLOs are internal thresholds that allow teams to guard their customer satisfaction. These thresholds are set based on SLIs, or singular metrics captured by the service’s monitorable data. SLIs take into account points on a user journey that are most important to a customer, such as latency, availability, throughput, or freshness of the data at certain junctions. These metrics (stated as good events/valid events over a period of time) indicate what your customers will care most about. SLOs are the objectives that you must meet to keep them happy.
Imagine that your service is an online shopping platform. Your customers care most about availability. In this case, you’ve determined that to keep customers happy, your service requires 99.9% availability. That means you can only have 43.83 minutes of downtime per month before customer happiness will be affected.
Based on this, you have wiggle room for planned maintenance or shipping new features that might risk a potential blip. But how much wiggle room are you comfortable with? An error budget policy is an agreement that all stakeholders make on what will be done in the event an SLO is threatened.
So imagine that out of your 43.83 minutes, you’ve used 21 minutes. Do you need minute-by minute alerts? Not likely. Instead, you’ll want to set up alerting thresholds that let you know when you’ve reached certain milestones such as 25%, 50%, 75%, and so on. Maybe you even automate these alerts so that, in the event that you hit these thresholds during different times of your monthly rolling window, you aren’t alerted at all. For example, you might not care that 75% of your error budget has been used if there are only 2 days left in the window.
By setting SLOs and creating thresholds that apply to them, you can minimize the unnecessary alerts you receive, allowing you to pay attention to the ones you really need to know about.
How much wiggle room are you comfortable with? An error budget policy is an agreement that all stakeholders make on what will be done in the event an SLO is threatened.
6. Squash repeat bugs
The only thing more annoying than being paged is being paged for the same thing over and over again. After a while, it’s likely that an engineer will begin to ignore alerts for repeat bugs, especially if they’re not customer-impacting. They will become desensitized to it, possibly overlooking it until the issue becomes larger and customer-affecting.
To avoid this, make sure that repeat issues are addressed or alerting for these is turned off. Take a look through your incident retrospectives and notice what issues crop up again and again and get alignment between product and engineering on the severity of this issue.
Does it need to be prioritized in the next sprint to cut down on repeat incidents? If so, taking care of that issue as soon as possible can save your on-call engineers a lot of unnecessary stress and frustration. If the bug isn’t important enough to prioritize anytime soon, then consider turning off alerting on it, or only alert when it’s tied to customer impact.
7. Improve the reliability of your system
It seems like it goes without saying – of course, if you have fewer incidents, you’ll have fewer alerts, and less alert fatigue! But it’s worth considering proactively improving reliability and reducing incidents, instead of just reactively alerting better. Improving reliability is a complex and multifaceted process. However, in terms of reducing alerts, the main factor to consider is how often your system produces incidents that require an alert and response.
Some amount of failure is inevitable, some incidents will occur. An important goal should be to prevent incidents that you’ve already dealt with before – not making the same mistake twice. There’s a lot more fatigue in getting alerted for something going wrong again. To prevent repeat incidents, use tools like incident retrospective. These documents help you find the causes of incidents and drive follow-up changes to stop those causes from recurring.
Tooling to reduce alert fatigue
Having tooling to implement these changes can make a much bigger impact on alert fatigue. A sophisticated alerting process can’t be toilsome, or it can create more work than it saves. Blameless can help. Our SLOs can show you the true severity of incidents, and our retrospectives let you learn and prevent recurrence. To see how, check out a demo.