Best Practices for Creating On-Call Rotations and Schedules
As users expect incidents and outages to be addressed as quickly as possible, any time of day, on-call rotations have become necessary for SRE and DevOps teams. How do you create on-call rotations schedules that are fair and reduce burnout?
What is an on-call rotation?
An on-call rotation is a schedule that rotates through a group of engineers so that there is always someone available to immediately respond to incidents, either by fixing the problem or escalating the issue to teams who can fix it. Employees are assigned shifts throughout the entire day and night to monitor the system and handle emergencies that negatively impact users. An on-call rotation is crucial for business continuity and overall customer experience.
An on-call rotation serves many vital purposes for the business. Downtime costs businesses an estimated $700 billion in North America alone, across organizations of all sizes.
However, if on-call rotations aren’t designed well, they can be a source of burnout and fatigue for staff members. While being on-call is a necessary part of the job, businesses and teams can work together to create an experience that is fair and doesn’t add to team stress levels.
What do on-call engineers do?
Most on-call rotations are delegated to sysadmins or operations engineers. Teams following a DevOps culture and site reliability engineering (SRE) framework may also have site reliability engineers on-call.
As part of SRE culture, SREs and on-call engineers are primarily responsible for monitoring and solving issues. During the on-call shift, they must monitor for problems and respond to any alerts flagged during their shift. Rotating on-call schedule responsibilities could include fixing broken code, handling server issues, and other issues that could significantly impact the business and users.
How is an on-call schedule created? Some on-call schedule examples:
Once you’ve settled on the need for on-call team members, there are a variety of ways to create an on-call schedule and plenty of on-call rotation ideas to borrow from.
For smaller teams
For teams of 5 or fewer, it’s often tricky creating an on-call schedule that doesn’t lead to burnout or alert fatigue as each engineer will have to take large shifts often to cover the entire timespan. Generally, the best practice is to create an alternating days schedule, or a weekly rotation.
Where possible, you can arrange for third-party monitoring, such as an on-call backup, if it’s a small team. In addition, you can create a rotating backup schedule for three or more members to ensure workloads stay balanced.
Location-based scheduling, i.e. follow the sun
A key example of on-call rotation schedules is the “following the sun” model. You can create a distributed work schedule based on locations if your team is large enough to span several time zones. That way, team members can handle on-call monitoring when it’s not the middle of the night for them, and the on-call rotation stays balanced.
Even if the team isn’t large enough to support that scheduling, it’s worth looking for additional support from different time zones to handle some or all night shifts.
Responsibility-based scheduling
When creating on-call schedules, it’s essential to consider who is responsible for what when it comes to service delivery. Unfortunately, that can significantly impact how much the on-call engineer can accomplish, as they may not have all the tools and knowledge needed for each service.
You can split on-call duties across different teams and services to keep it as disruption-free as possible. You can also automate as much as possible, using tools like automated runbooks, to reduce on-call workload and create standardized processes for issues.
Frequency-based scheduling
The frequency the team rotates on will vary based on team size and workloads, but a few options are available. For example, you could opt for semimonthly schedules, where team members rotate based on responsibilities outlined.
You could also do a week/weekend schedule where some team members cycle through during the week, others through the weekend, and then switch. You can also create ad-hoc scheduling and rotation based on needs, but that can often be difficult to manage and not feasible long-term.
What are on-call best practices?
There are many ways to create an on-call schedule, depending on team size and needs. However, if you aren’t sure how to do on-call rotation, there are some best practices that can help you come up with a tailored on-call rotation for your needs.
Communicate early and often
Before taking on any models, it’s crucial to first work together with the team to understand what they want.
Forcing people to go on-call without them contributing to the schedule will not work out well in the long run regarding employee wellbeing and productivity. Involving them in the process keeps the process transparent and allows everyone to provide feedback. For example, employee preferences could influence scheduling if some prefer to work later hours than others. Or they can suggest rotation schedules based on workloads and expectations, such as one week on/one week off.
What’s alert fatigue?
Alert fatigue is when engineers are overwhelmed by too many alerts and can’t sort through them to properly triage and respond. Identify what kinds of alerts on-call team members need to receive versus what can be automated. Teams must decide what types of incidents need an alert in the middle of the night and what can wait until the morning.
Establish on-call responsibilities
You don’t want team members to feel adrift while on-call, especially when it’s odd hours and they can’t just ping team members with questions. So it’s essential to set up expectations and responsibilities from the start and save documentation so team members can refer back to it.
For example, does being on-call mean responding to alerts as they come up? Or does it involve some kind of active monitoring? And what should team members do if it’s a problem they can't immediately solve? Having runbooks and documented protocol reduces the midnight panic and enables everyone to access information needed to resolve incidents quickly and effectively.
Avoid the night shift where possible
Night shifts are demanding and draining and can negatively impact employee health and well-being. Unless employees specifically ask for overnight shifts, try to avoid them in scheduling as much as possible. If the night shift is absolutely unavoidable, establish clear protocols and rules for contacting others in the evening.
For employees taking on the night shift, offer them flexible hours in the morning, so they don’t feel like they are constantly working. Promote rest and sleep as much as possible, and encourage employees to speak up when they feel negatively impacted by on-call rotations.
Balance workloads
For on-call team members, juggling their daily workload alongside on-call responsibilities leads to burnout and fatigue. Offering additional support and flexibility is crucial and goes a long way in improving employee satisfaction and well-being. When team members are on call, create workflows that are less reliant on them during the day to reduce some of their to-do list.
Other team members may need to balance the workload for those who are on-call, but again, it needs to be fairly distributed. Allow flexibility as much as possible and ask teams to collaborate to develop solutions to reduce bottlenecks overall.
Automate, automate, automate
Much on-call work can be automated to ensure that team members aren’t doing a lot of unnecessary work. For example, monitoring and alert responses can be automated to an extent for common and/or minor issues.
You can use incident response tools to create automated workflows, including runbook automation, to reduce the steps needed to handle an incident on-call. It enables team members to work better while on-call without getting tons of alerts for issues that aren’t necessarily needed.
Don’t make it one function’s responsibility
Some on-call schedules overly rely on operations engineers or specific teams, which can lead to issues down the line.
One person or a small team can’t reasonably handle a large structure without struggling, so it’s vital to consider imbalances within on-call rotations. Having an SRE team or supplementing on-call with incident response tools can reduce some of the workload.
How can Blameless help?
With the right tools, on-call doesn’t have to be a source of stress. Blameless facilitates on-call rotations through monitoring, alerts, and giving teams the tools needed to collaborate. Blameless automates incident response by initiating task assignments, centralizing context, and keeping event data in one place.
Blameless manages checklists, runbooks, and more to help teams focus on what’s essential and automate the rest for smoother incident response overall. All the information is documented and saved, including steps taken so that teams have what they need to improve workflows and processes moving forward.
Save real-time data and pinpoint key steps for retrospectives to keep teams collaborative without assigning blame. Learn more about how Blameless improves the on-call experience by scheduling a demo today.