Complete Guide to Service Level Objectives (SLOs) That Work
Wondering what Service Level Objectives (SLOs) are? In this article, we will explain service level objectives and how they relate to SLAs, SLIs, and error budgets.
What is a Service Level Objective?
A Service Level Objective (SLO) is a reliability target, measured by a Service Level Indicator (SLI) and sometimes serves as a safeguard for a Service Level Agreement (SLA). SLOs represent customer happiness and guide the development team’s velocity.
SLOs quantify customers’ expectations for reliability and start conversations between product and engineering on reliability goals and action plans when the goal is at risk. An example SLO for a service is 95% availability in a rolling 28-day window.
You might feel tempted to set the objective at 100%, but that’s too good to be true. Change brings instability, which will inevitably lead to failure. Not only is 100% reliability an impossible target, but it would also mean that you can't make any changes to the service in production. So expecting perfect reliability is the same as choosing to stop any new features from reaching customers and choosing to stop competing in the market.
The rule of thumb for setting an SLO is to find the point where the customer is happy with the service's reliability.
Increasing the reliability target to no end is not a great business decision. The goal is not perfection, but making customers happy with just the right level of reliability. Once the customers are satisfied with the service, extra reliability offers little value.
Why? Because if customers are happy with your reliability, they’ll want new features instead of more reliability!
If customers are happy with your reliability, they’ll want new features instead of more reliability!
What is the Purpose of an SLO?
The purpose of SLO is to measure customer happiness, (if applicable) protect the company from SLA violations, and create a shared understanding of reliability across product, engineering, and business leadership.
What the customers want is just the right balance between reliability and innovation. They want features and want to be able to use the features at any time. Businesses must innovate and release new features to drive revenue and growth while maintaining reliability to retain the customers. Setting an SLO can help teams discuss and agree on how to balance reliability with feature velocity using a data-driven approach.
Service level objectives are most effective when they are owned by someone who can make tradeoffs between feature velocity and reliability. In a small organization, the CTO often takes the role and in a larger organization, the duty would ideally fall upon the product owner or product manager.
What is the Difference between SLA, SLI, and SLO?
The difference between the three terms is simple. SLI is the indicator that's used to define and measure the SLO. SLA does not exist for every business, but when there is an SLA, it serves as an upper bound for SLO.
For example: suppose SLA is your credit card limit, then SLO would be your budget, and SLI would be your actual expense. Now, every person has expenses and should ideally monitor the expenses against a set budget. However, not everyone owns a credit card, but if they do, then their budget would be below the credit card limit. If it exceeds the limit, then there will be some repercussions.
Let's take a look at the SLI and the SLA in more detail.
SLI (Service Level Indicator)
An SLI is used to measure a service’s reliability. It’s a quantifiable metric built from monitoring data of your service. The key to selecting the right indicator is to find out what your customers expect from your service. Additionally, you shouldn't choose too many indicators as it will drive your attention away from the most indicative one or two indicators.
Traditionally, SLIs are either calculated in terms of latency or availability, but you can also use freshness, durability, quality, correctness, and coverage. For different types of systems, the SLI metric is different.
- In User-facing systems, SLI is usually calculated in terms of availability, latency, and throughput. In simpler terms: can we respond to the request? If so, then how long does it take to respond?
- Data processing systems or pipelines usually emphasize throughput or latency. In simpler words: how much data is processed? And how long did it take the system to progress from data ingestion to completion?
- Storage systems focus on latency, availability, and durability. In simpler words: How long does it take the system to read or write data? Can the user access on-demand data?
SLA (Service Level Agreement)’
An SLA is a legal agreement between the service provider and the customer. It includes the minimum reliability target for the service and the financial consequences of not meeting it. The consequences may include a partial refund, discounts, or extra credits. An SLO is an internal objective for your team and is not usually a part of the client contract.
SLOs and SLAs are often confused, but they’re two distinct concepts. Because SLO is an internal objective, it does not have an associated financial penalty when breached. When there is a SLA, then the corresponding SLO is generally tighter. For example, if the SLA defines 99.5% uptime, then the internal objective can be 99.8%. By setting a more stringent internal objective, the SRE team gets a chance to take proactive actions to avoid going over the SLA and breaking the contractual agreement.
SLAs, unlike SLOs, are set by the business development and legal team. That can be a challenge as both legal and business development teams are not directly involved in building or running the technology. Therefore, involving the IT and DevOps teams alongside can increase the probability of creating a functional SLA.
How to Define the SLO of a Service?
Defining SLOs for a service is a long process and the specific approach varies for different companies. We will discuss two approaches below.
Product-centric Approach
We will discuss the step-by-step approach to defining service level objectives. There are no hard and fast rules concerning the order of the steps. Some companies start by defining user journeys and formulating the SLO accordingly, whereas others start with metrics and hypothesize user journeys later to improve and refine an existing SLO.
- Define user journey with the product team
- Identify the key services that are on the user journey and select the best SLI type
- Define SLI
- Define SLO
- Create an error budget policy
- Monitor and report on the SLO
- Periodically re-evaluate the SLO and make changes as needed
Engineering-centric Approach
The second, engineering-centered approach toward defining SLOs is:
- Define SLI
- Define SLO
- Monitor and report based on SLO
Many companies often start with the second approach, and encounter difficulties with wider adoption. Error budget policies are crucial as they make the SLOs actionable. We also advise teams to circle back to the product and outline critical user journeys.
Error budget policies are crucial as they make the SLOs actionable.
Who Defines the SLO?
Defining an SLO is a collaborative process driven by the SRE team, but it requires input from multiple stakeholders across the organization.
The key stakeholder involved are:
- Product Owners ranging from product managers to business analysts. The product owners try to anticipate the customer needs and communicate them to the development and SRE teams. Ideally, they contribute to the definition of SLO to reflect customer needs.
- SRE & ops teams consist of DevOps ITSM & problem management, and infrastructure engineers that lay down the groundworks for the development team. They help ensure that the SLO is realistic, sustainable, and without excessive toil that causes burnout.
- The development team comprises software teams that are developing the software product. They can pitch on about the SLOs and negotiate relaxation if reliability work is slowing down the release velocity.
- Both internal and external users and stakeholders fall under the Customer umbrella. They contribute to SLO definition via feedback meetings, customer complaints, Tweets, and SLA.
Once the SLO is determined, it's documented by the authors, reviewers (who check for technical accuracy), and approvers (who weigh in based on business considerations).
To make service level objectives work for various parts of your organization, each team would ideally agree that the SLO is a reasonable approximation of user experience and use them as the principal driver for decision making. Not meeting the SLOs usually has well-documented consequences that redirect the engineering efforts towards improving reliability. To enforce the consequences, the operations team requires executive support.
At the end of the day, SLOs align incentives but they’re not enough on their own. In a heavily siloed organization, it’s much harder to reach an agreement. The best chances of success are when there’s a shared sense of responsibility between the developers and the SRE/Ops team. Developers feel their responsibility towards making the service reliable and the SRE/Ops team feels the responsibility to help the developers actively release new features.
What are Some Characteristics of a Well-thought-out SLO?
A good service level objective must align with the company’s specific business needs. For example, if all your customers are in the same time zone, and work 9-5, then availability outside of the active hours wouldn’t matter to the customer. As the customer will not try to access the service, they won’t be unhappy if it breaks during their inactive hours.
Secondly, according to Google’s paper on meaningful reliability:
“A good service availability metric should be meaningful, proportional, and actionable.”
Meaningful, in this context, means it captures user experiences. Proportional means that any change in the metric must be proportional to a variation in user-perceived availability. Finally, actionable means it provides system owners an insight into why availability was low over a specific period of time.
Lastly, a solid SLO must be realistic. The objective shouldn’t be too far off from how the services have been performing so far, and it’s best decided with the team’s resource constraints kept in mind. You don’t want to aim for the stars and demoralize your team with an unrealistic objective.
What are Some Challenges and Pitfalls of Creating SLOs?
Creating service level objectives can be quite challenging especially in the beginning. Everyone wants 100% reliability, which is unrealistic. It means that the service has zero error budget (no tolerance towards failure), which is a drawback in itself.
Another common pitfall is starting with way too many SLOs at an earlier stage. Given the complexity of most systems, starting small and iterating over time is the best course of action. You don’t want to make the system more complex than it needs to be. Only the most critical services need to be measured, and would ideally have only two to six SLIs.
Not spelling out SLOs in plain and simple language is another common pitfall. Since it’s mostly an internal objective, they’re usually to help the development and SRE/Ops team balance feature development with reliability work. They should ideally be simple enough that anyone on the development and SRE/Ops team can understand it.
Finally, creating an SLO is a collaborative process that requires input and buy-in from everyone including the leadership, development team, SRE/Ops team, product owners, etc. To make the SLOs work, all relevant teams and individuals agree that it’s reasonable and can be used as the basis for decision-making. Also, consequences from not meeting SLOs can hardly be enforced without executive support.
How are SLOs Related to the Error Budget?
Service level objectives are used to calculate error budget - a tool used to balance innovation with reliability. Error budget defines the acceptable level of unreliability that a service can afford without impacting customer happiness.
As long as the service remains within the error budget, developers can take more risks. On the other hand, when the error budget starts to dry up, the developers would likely need to make safer choices.
Here’s how you can calculate the error budget using SLOs:
Error Budget = 1 - Availability SLI
For example, if the SLO is 97%:
Error Budget = 1- 97% = 3%
If your service receives 100,000 requests in four weeks and the SLO is 97%, then the error budget is 1030 errors in four weeks.
SLOs Monitoring - Who Does It and How?
The best way to monitor SLO is through monitoring error budget policies. The SRE team sets the monitoring system to send an alert if a particular percentage of the error budget is consumed. For example, send an alert if 75% (or 50%) of the error budget is consumed over a 7-day period.
Monitoring SLOs is mainly the job of the SRE team. They collect the SLI metrics and work with other teams to define the SLO. If the SLO is at risk, the teams then decide if any action is needed and figure out how to meet the SLO. The SRE team collaborates with development and product teams to ensure the targets and policies are agreed upon.
The process starts with monitoring and measuring the service’s SLIs over time. The SLIs are then compared against the SLOs. If an action is required, then the SRE team figures out what steps must be taken to meet the target. Without SLOs and error budget policies, the SRE team will have no way to decide whether and when they should take an action.
How Often is the SLO Evaluated?
Running a service with SLO is an adaptive and iterative process. In a 12-month period, a lot will change. New features might not be covered, customer expectations might change, or potentially, there might be a change in the company's risk-reward profile.
It's important to reevaluate SLO every few months because it's no good if you're meeting your SLO but your customers are still unhappy and complaining on Twitter or Zendesk. Review and reevaluate your SLO every few months and follow up with a similar review every six to twelve months.
There are no hard and fast rules established regarding the SLO evaluation. Depending on the product, expected usage, and managing team, SLOs can be different for each organization. You can consider all user groups such as mobile users, desktop users, and people from various geographic locations and modify the SLOs accordingly.
Evaluate and refine the targets until you locate the most optimal point. For example, if your team is continuously performing way above the SLO, then:
- You can tighten up the SLO and increase service reliability (and maybe tell your customers about your superior reliability as a competitive advantage), or
- Capitalize on the unused error budget by investing in product development or experiments.
Whereas, if your team is continuously struggling with keeping up with the SLOs, then:
- You can bring them down the SLO to a more manageable level.
- Invest in stabilizing the product before rolling out new features.
SLOs are continuously evaluated, and are all about learning, innovating, and starting over!
What are the Consequences of Not Meeting an SLO?
The consequences of not meeting the SLOs usually involve code freezes, slowing down development, and shifting more resources towards bug fixes.
What’s important here is that the consequences of not meeting the SLOs are agreed upon by the product team, developers, and SREs. SREs can also use the error budget policies to alert relevant leaders as soon as the SLO is at risk. For example, if the SLO is 95%, then alerting the leaders at 97.5% can help them take action and prioritize reliability vs. feature velocity accordingly.
How can Blameless help with the process?
Defining service level objectives can be challenging for any organization, but Blameless can make the process seamless with its SLO Manager. You can use it to create user journeys, define SLIs and SLOs for your service, monitor and report on SLOs, and move towards error budget-based alerting and prioritization. Sign up for a free trial today.